<h1 align='center'>Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer</h1> <div align='center'> <a href='https://github.com/cuijh26' target='_blank'>Jiahao Cui</a><sup>1</sup>&emsp; <a href='https://github.com/crystallee-ai' target='_blank'>Hui Li</a><sup>1</sup>&emsp; <a href='https://github.com/subazinga' target='_blank'>Yun Zhan</a><sup>1</sup>&emsp; <a href='https://github.com/NinoNeumann' target='_blank'>Hanlin Shang</a><sup>1</sup>&emsp; <a href='https://github.com/Kaihui-Cheng' target='_blank'>Kaihui Cheng</a><sup>1</sup>&emsp; <a href='https://github.com/mayuqi7777' target='_blank'>Yuqi Ma</a><sup>1</sup>&emsp; <a href='https://github.com/AricGamma' target='_blank'>Shan Mu</a><sup>1</sup>&emsp; </div> <div align='center'> <a href='https://hangz-nju-cuhk.github.io/' target='_blank'>Hang Zhou</a><sup>2</sup>&emsp; <a href='https://jingdongwang2017.github.io/' target='_blank'>Jingdong Wang</a><sup>2</sup>&emsp; <a href='https://sites.google.com/site/zhusiyucs/home' target='_blank'>Siyu Zhu</a><sup>1✉️</sup>&emsp; </div> <div align='center'> <sup>1</sup>Fudan University&emsp; <sup>2</sup>Baidu Inc&emsp; </div> <br> <div align='center'> <a href='https://github.com/fudan-generative-vision/hallo3'><img src='https://img.shields.io/github/stars/fudan-generative-vision/hallo3?style=social'></a> <a href='https://fudan-generative-vision.github.io/hallo3/#/'><img src='https://img.shields.io/badge/Project-HomePage-Green'></a> <a href='https://arxiv.org/pdf/2412.00733'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/fudan-generative-ai/hallo3'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-yellow'></a> <a href='assets/wechat.jpeg'><img src='https://badges.aleen42.com/src/wechat.svg'></a> </div> <div align='Center'> <i><strong><a href='https://cvpr.thecvf.com/Conferences/2025' target='_blank'>CVPR 2025</a></strong></i> </div> <br> <table align='center' border="0" style="width: 100%; text-align: center; margin-top: 80px;"> <tr> <td> <video align='center' src="https://github.com/user-attachments/assets/f14bf935-ceaa-4dae-98b9-d7e54633475d" muted autoplay loop></video> </td> </tr> </table>

📸 Showcase

Visit our project page to view more cases.

📰 News

2025/02/27: 🎉🎉🎉 Our paper has been accepted to CVPR 2025.
2025/01/27: 🎉🎉🎉 Release training data on HuggingFace. The collection comprises over 70 hours of talking-head videos, complemented by an additional 50 hours of dynamic video clips featuring vibrant foregrounds and backgrounds.

⚙️ Installation

System requirement: Ubuntu 20.04/Ubuntu 22.04, Cuda 12.1
Tested GPUs: H100

Download the codes:

  git clone https://github.com/fudan-generative-vision/hallo3
  cd hallo3

Create conda environment:

  conda create -n hallo python=3.10
  conda activate hallo

Install packages with pip

  pip install -r requirements.txt

Besides, ffmpeg is also needed:

  apt-get install ffmpeg

📥 Download Pretrained Models

You can easily get all pretrained models required by inference from our HuggingFace repo.

Using huggingface-cli to download the models:

cd $ProjectRootDir
pip install "huggingface_hub[cli]"
huggingface-cli download fudan-generative-ai/hallo3 --local-dir ./pretrained_models

Or you can download them separately from their source repo:

hallo3: Our checkpoints.
Cogvidex: Cogvideox-5b-i2v pretrained model, consisting of transformer and 3d vae
t5-v1_1-xxl: text encoder, you can download from text_encoder and tokenizer
audio_separator: Kim Vocal_2 MDX-Net vocal removal model.
wav2vec: wav audio to vector model from Facebook.
insightface: 2D and 3D Face Analysis placed into pretrained_models/face_analysis/models/. (Thanks to deepinsight)
face landmarker: Face detection & mesh model from mediapipe placed into pretrained_models/face_analysis/models.

Finally, these pretrained models should be organized as follows:

./pretrained_models/
|-- audio_separator/
|   |-- download_checks.json
|   |-- mdx_model_data.json
|   |-- vr_model_data.json
|   `-- Kim_Vocal_2.onnx
|-- cogvideox-5b-i2v-sat/
|   |-- transformer/
|       |--1/
|           |-- mp_rank_00_model_states.pt  
|       `--latest
|   `-- vae/
|           |-- 3d-vae.pt
|-- face_analysis/
|   `-- models/
|       |-- face_landmarker_v2_with_blendshapes.task  # face landmarker model from mediapipe
|       |-- 1k3d68.onnx
|       |-- 2d106det.onnx
|       |-- genderage.onnx
|       |-- glintr100.onnx
|       `-- scrfd_10g_bnkps.onnx
|-- hallo3
|   |--1/
|       |-- mp_rank_00_model_states.pt 
|   `--latest
|-- t5-v1_1-xxl/
|   |-- added_tokens.json
|   |-- config.json
|   |-- model-00001-of-00002.safetensors
|   |-- model-00002-of-00002.safetensors
|   |-- model.safetensors.index.json
|   |-- special_tokens_map.json
|   |-- spiece.model
|   |-- tokenizer_config.json
|   
`-- wav2vec/
    `-- wav2vec2-base-960h/
        |-- config.json
        |-- feature_extractor_config.json
        |-- model.safetensors
        |-- preprocessor_config.json
        |-- special_tokens_map.json
        |-- tokenizer_config.json
        `-- vocab.json

🛠️ Prepare Inference Data

Hallo3 has a few simple requirements for the input data of inference:

Reference image must be 1:1 or 3:2 aspect ratio.
Driving audio must be in WAV format.
Audio must be in English since our training datasets are only in this language.
Ensure the vocals of audio are clear; background music is acceptable.

🎮 Run Inference

Gradio UI

To run the Gradio UI simply run hallo3/app.py:

python hallo3/app.py

Gradio Demo

Batch

Simply to run the scripts/inference_long_batch.sh:

bash scripts/inference_long_batch.sh ./examples/inference/input.txt ./output

Animation results will be saved at ./output. You can find more examples for inference at examples folder.

Training

Prepare data for training

Begin your data-exploration by downloading the training dataset from the HuggingFace Dataset Repo. This dataset contains over 70 hours of talking-head videos, focusing on the speaker's face and speech, and more than 50 wild-scene clips from various real-world settings. After downloading, simply unzip all the .tgz files to access the data and start your projects and organize them into the following directory structure:

dataset_name/
|-- videos/
|   |-- 0001.mp4
|   |-- 0002.mp4
|   `-- 0003.mp4
|-- caption/
|   |-- 0001.txt
|   |-- 0002.txt
|   `-- 0003.txt

You can use any dataset_name, but ensure the videos directory and caption directory are named as shown above.

Next, process the videos with the following commands:

bash scripts/data_preprocess.sh {dataset_name} {parallelism} {rank} {output_name}

Training

Update the data meta path settings in the configuration YAML files, configs/sft_s1.yaml and configs/sft_s2.yaml:

#sft_s1.yaml
train_data: [
    "./data/output_name.json"
]

#sft_s2.yaml
train_data: [
    "./data/output_name.json"
]

Start training with the following command:

# stage1
bash scripts/finetune_multi_gpus_s1.sh

# stage2
bash scripts/finetune_multi_gpus_s2.sh

📝 Citation

If you find our work useful for your research, please consider citing the paper:

@misc{cui2024hallo3,
	title={Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer}, 
	author={Jiahao Cui and Hui Li and Yun Zhan and Hanlin Shang and Kaihui Cheng and Yuqi Ma and Shan Mu and Hang

Hallo3

Install / Use

README

📸 Showcase

📰 News

⚙️ Installation

📥 Download Pretrained Models

🛠️ Prepare Inference Data

🎮 Run Inference

Gradio UI

Batch

Training

Prepare data for training

Training

📝 Citation