VideoMaker

This is the official implementation of VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

Generate Convert Improve

Install / Use

/learn @WuTao-CS/VideoMaker

About this skill

Quality Score

0/100

README

✨VideoMaker✨

<p><b>VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models</b>.</p>

</div>

🥳 Demo

first_fig

Please check more demo videos at the project page.

🔆 Abstract

Zero-shot customized video generation has gained significant attention due to its substantial application potential. Existing methods rely on additional models to extract and inject reference subject features, assuming that the Video Diffusion Model (VDM) alone is insufficient for zero-shot customized video generation. However, these methods often struggle to maintain consistent subject appearance due to suboptimal feature extraction and injection techniques. In this paper, we reveal that VDM inherently possesses the force to extract and inject subject features. Departing from previous heuristic approaches, we introduce a novel framework that leverages VDM’s inherent force to enable high-quality zero-shot customized video generation. Specifically, for feature extraction, we directly input reference images into VDM and use its intrinsic feature extraction process, which not only provides fine-grained features but also significantly aligns with VDM’s pre-trained knowledge. For feature injection, we devise an innovative bidirectional interaction between subject features and generated content through spatial self-attention within VDM, ensuring that VDM has better subject fidelity while maintaining the diversity of the generated video. Experiments on both customized human and object video generation validate the effectiveness of our framework.

📋 TODO

✅ Release inference code and model weights
✅ Release arxiv paper
✅ Release gradio demo
⬜️ Release training code

😉 Pipline

pipline

📦 Installation

pip install -r requirements.txt

🛠️ Preparation

Prepare all pretrained models to ./pretrain_model/ folder.

Prepare pretrained Realistic_Vision_V5.1_noVAE weights

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/SG161222/Realistic_Vision_V5.1_noVAE

Prepare pretrained AnimateDiff SD1.5 weights

git clone https://huggingface.co/guoyww/animatediff-motion-adapter-v1-5-3

Prepare pretrained VideoMaker weights

git clone https://huggingface.co/SugerWu/VideoMaker

📊 Inference

💫 Custom Human Video Generation

We recommend using Grounded-SAM-2 or SAM-2 preprocess the input image so that only the facial area is retained in the reference image. We provide some processed examples in ./examples/human

# singe reference image single prompt inference
python inference.py \
  --seed 1234 \
  --prompt 'A person wearing a Superman outfit.' \
  --image_path 'examples/barack_obama.png' \
  --weight_path './pretrain_model/VideoMaker/human_pytorch_model.bin'

# singe reference image multiple prompt inference
python inference.py \
  --seed 1234 \
  --prompt './prompts/example.txt' \
  --image_path 'examples/barack_obama.png' \
  --weight_path './pretrain_model/VideoMaker/human_pytorch_model.bin'

# multiple reference image multiple prompt inference
python inference.py \
  --seed 1234 2048 \
  --prompt './prompts/example.txt' \
  --image_path 'examples/human/' \
  --weight_path './pretrain_model/VideoMaker/human_pytorch_model.bin'

📷 Custom object Video Generation

Due to the limitation of the VideoBooth datasets, we only support the following nine categories of objects: bear, car, cat, dog, elephant, horse, lion, panda, tiger. We recommend using Grounded-SAM-2 preprocess the input image so that only the main object is retained in the reference image.

# singe reference image single prompt inference
python inference.py \
  --seed 1234 \
  --prompt 'A horse running through a shallow stream.' \
  --image_path 'examples/object/horse1.jpg' \
  --weight_path './pretrain_model/VideoMaker/object_pytorch_model.bin'

🤗 Gradio Demo

Start a local gradio demo

Run the following command:


# Custom Human Video Generation Demo
python gradio_demo/human_app.py 

# Custom object Video Generation Demo
python gradio_demo/object_app.py

📭Citation

If you find VideoMaker helpful to your research, please cite our paper:

@article{wu2024videomaker,
  title={Videomaker: Zero-shot customized video generation with the inherent force of video diffusion models},
  author={Wu, Tao and Zhang, Yong and Cun, Xiaodong and Qi, Zhongang and Pu, Junfu and Dou, Huanzhang and Zheng, Guangcong and Shan, Ying and Li, Xi},
  journal={arXiv preprint arXiv:2412.19645},
  year={2024}
}

Related Skills

qqbot-channel

351.2k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

100.5k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

351.2k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

WuTao-CS

View profile

View on GitHub

GitHub Stars16

CategoryContent

Updated9mo ago

Forks0

WuTao-CS/VideoMaker

Languages

Python

Security Score

67/100

Audited on Jun 24, 2025

No findings