FollowYourPose
[AAAI 2024] Follow-Your-Pose: This repo is the official implementation of "Follow-Your-Pose : Pose-Guided Text-to-Video Generation using Pose-Free Videos"
Install / Use
/learn @mayuelala/FollowYourPoseREADME
Yue Ma*, Yingqing He*, Xiaodong Cun, Xintao Wang, Siran Chen, Ying Shan, Xiu Li, and Qifeng Chen
<a href='https://arxiv.org/abs/2304.01186'><img src='https://img.shields.io/badge/ArXiv-2304.01186-red'></a>
<a href='https://follow-your-pose.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
💃💃💃 Demo Video
https://github.com/mayuelala/FollowYourPose/assets/38033523/e021bce6-b9bd-474d-a35a-7ddff4ab8e75
💃💃💃 Abstract
<b>TL;DR: We tune the text-to-image model (e.g., stable diffusion) to generate the character videos from pose and text description.</b>
<details><summary>CLICK for full abstract</summary></details>Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e., image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint-image pairs are used only for a controllable textto-image generation. We learn a zero-initialized convolutional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models will be made publicly available.
🕺🕺🕺 Changelog
<!-- A new option store all the attentions in hard disk, which require less ram. -->- [2024.03.15] 🔥 🔥 🔥 We release the Second Follower Follow-Your-Click, the first framework to achieve regional image animation. Try it now! Please give us a star! ⭐️⭐️⭐️ 😄
- [2023.12.09] 🔥 The paper is accepted by AAAI 2024!
- [2023.08.30] 🔥 Release some new results!
- [2023.07.06] 🔥 Release A new version of
浦源内容平台 demo! Thanks for the support of Shanghai AI Lab!
- [2023.04.12] 🔥 Release local gradio demo and you could run it locally, only need a A100/3090.
- [2023.04.11] 🔥 Release some cases in
huggingface demo. - [2023.04.10] 🔥 Release A new version of
huggingface demo, which support both
raw videoandskeleton videoas input. Enjoy it! - [2023.04.07] Release the first version of
huggingface demo. Enjoy the fun of following your pose! You need to download the skeleton video or make your own skeleton video by mmpose. Additionaly, the second version which regard thevideo formatas input is comming. - [2023.04.07] Release a
colab notebookand updata the
requirementsfor installation! - [2023.04.06] Release
code,configandcheckpoints! - [2023.04.03] Release Paper and Project page!
💃💃💃 HuggingFace Demo
<table class="center"> <td><img src="https://user-images.githubusercontent.com/38033523/231338219-94b54b10-3fdc-4bf5-9e07-0c1ff236793a.png"></td> <td><img src="https://user-images.githubusercontent.com/38033523/231337960-a30db639-2ecc-486f-8d95-3b3e9c2ed338.png"></td> </tr> </table>🎤🎤🎤 Todo
- [X] Release the code, config and checkpoints for teaser
- [X] Colab
- [X] Hugging face gradio demo
- [ ] Release more applications
🍻🍻🍻 Setup Environment
Our method is trained using cuda11, accelerator and xformers on 8 A100.
conda create -n fupose python=3.8
conda activate fupose
pip install -r requirements.txt
xformers is recommended for A100 GPU to save memory and running time.
We find its installation not stable. You may try the following wheel:
wget https://github.com/ShivamShrirao/xformers-wheels/releases/download/4c06c79/xformers-0.0.15.dev0+4c06c79.d20221201-cp38-cp38-linux_x86_64.whl
pip install xformers-0.0.15.dev0+4c06c79.d20221201-cp38-cp38-linux_x86_64.whl
</details>
Our environment is similar to Tune-A-video (official, unofficial). You may check them for more details.
💃💃💃 Training
We fix the bug in Tune-a-video and finetune stable diffusion-1.4 on 8 A100. To fine-tune the text-to-image diffusion models for text-to-video generation, run this command:
TORCH_DISTRIBUTED_DEBUG=DETAIL accelerate launch \
--multi_gpu --num_processes=8 --gpu_ids '0,1,2,3,4,5,6,7' \
train_followyourpose.py \
--config="configs/pose_train.yaml"
🕺🕺🕺 Inference
Once the training is done, run inference:
TORCH_DISTRIBUTED_DEBUG=DETAIL accelerate launch \
--gpu_ids '0' \
txt2video.py \
--config="configs/pose_sample.yaml" \
--skeleton_path="./pose_example/vis_ikun_pose2.mov"
You could make the pose video by mmpose , we detect the skeleton by HRNet. You just need to run the video demo to obtain the pose video. Remember to replace the background with black.
💃💃💃 Local Gradio Demo
You could run the gradio demo locally, only need a A100/3090.
python app.py
then the demo is running on local URL: http://0.0.0.0:Port
🕺🕺🕺 Weight
[Stable Diffusion] Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from Hugging Face (e.g., Stable Diffusion v1-4)
[FollowYourPose] We also provide our pretrained checkpoints in Huggingface. you could download them and put them into checkpoints folder to inference our models.
FollowYourPose
├── checkpoints
│ ├── followyourpose_checkpoint-1000
│ │ ├──...
│ ├── stable-diffusion-v1-4
│ │ ├──...
│ └── pose_encoder.pth
💃💃💃 Results
We show our results regarding various pose sequences and text prompts.
Note mp4 and gif files in this github page are compressed. Please check our Project Page for mp4 files of original video results.
<table class="center"> <tr> <td><img src="gif_results/new_result_0830/Trump_on_the_mountain.gif"></td> <td><img src="gif_results/new_result_0830/Trump_wear_yellow.gif"></td> <td><img src="gif_results/new_result_0830/astranaut_on_the_mountain.gif"></td> </tr> <tr> <td width=25% style="text-align:center;">"Trump, on the mountain "</td> <td width=25% style="text-align:center;">"man, on the mountain "</td> <td width=25% style="text-align:center;">"astronaut, on mountain"</td> <Related Skills
docs-writer
98.5k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
326.5kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
arscontexta
2.8kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
skills
N-API Template - Code Generation Reference This document serves as a comprehensive reference for generating API code following the N-API Template standards. Use this guide to ensure consistency and
