Hallo3
[CVPR 2025] Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer
Install / Use
/learn @fudan-generative-vision/Hallo3README
📸 Showcase
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> <tr> <td> <video src="https://github.com/user-attachments/assets/3fc44086-bdbf-4a54-bfe3-62cfd9dfb191" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/ad5a87cf-b50e-48d6-af35-774e3b1713e7" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/78c7acc3-4fa2-447e-b77d-3462d411c81c" width="100%" controls autoplay loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/f62f2b6d-9846-40be-a976-56cc7d5a8a5b" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/42b6968e-c68a-4473-b773-406ccf5d90b1" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/015f1d6d-31a8-4454-b51a-5431d3c953c2" width="100%" controls autoplay loop></video> </td> </tr> </table>Visit our project page to view more cases.
📰 News
2025/02/27: 🎉🎉🎉 Our paper has been accepted to CVPR 2025.2025/01/27: 🎉🎉🎉 Release training data on HuggingFace. The collection comprises over 70 hours of talking-head videos, complemented by an additional 50 hours of dynamic video clips featuring vibrant foregrounds and backgrounds.
⚙️ Installation
- System requirement: Ubuntu 20.04/Ubuntu 22.04, Cuda 12.1
- Tested GPUs: H100
Download the codes:
git clone https://github.com/fudan-generative-vision/hallo3
cd hallo3
Create conda environment:
conda create -n hallo python=3.10
conda activate hallo
Install packages with pip
pip install -r requirements.txt
Besides, ffmpeg is also needed:
apt-get install ffmpeg
📥 Download Pretrained Models
You can easily get all pretrained models required by inference from our HuggingFace repo.
Using huggingface-cli to download the models:
cd $ProjectRootDir
pip install "huggingface_hub[cli]"
huggingface-cli download fudan-generative-ai/hallo3 --local-dir ./pretrained_models
Or you can download them separately from their source repo:
- hallo3: Our checkpoints.
- Cogvidex: Cogvideox-5b-i2v pretrained model, consisting of transformer and 3d vae
- t5-v1_1-xxl: text encoder, you can download from text_encoder and tokenizer
- audio_separator: Kim Vocal_2 MDX-Net vocal removal model.
- wav2vec: wav audio to vector model from Facebook.
- insightface: 2D and 3D Face Analysis placed into
pretrained_models/face_analysis/models/. (Thanks to deepinsight) - face landmarker: Face detection & mesh model from mediapipe placed into
pretrained_models/face_analysis/models.
Finally, these pretrained models should be organized as follows:
./pretrained_models/
|-- audio_separator/
| |-- download_checks.json
| |-- mdx_model_data.json
| |-- vr_model_data.json
| `-- Kim_Vocal_2.onnx
|-- cogvideox-5b-i2v-sat/
| |-- transformer/
| |--1/
| |-- mp_rank_00_model_states.pt
| `--latest
| `-- vae/
| |-- 3d-vae.pt
|-- face_analysis/
| `-- models/
| |-- face_landmarker_v2_with_blendshapes.task # face landmarker model from mediapipe
| |-- 1k3d68.onnx
| |-- 2d106det.onnx
| |-- genderage.onnx
| |-- glintr100.onnx
| `-- scrfd_10g_bnkps.onnx
|-- hallo3
| |--1/
| |-- mp_rank_00_model_states.pt
| `--latest
|-- t5-v1_1-xxl/
| |-- added_tokens.json
| |-- config.json
| |-- model-00001-of-00002.safetensors
| |-- model-00002-of-00002.safetensors
| |-- model.safetensors.index.json
| |-- special_tokens_map.json
| |-- spiece.model
| |-- tokenizer_config.json
|
`-- wav2vec/
`-- wav2vec2-base-960h/
|-- config.json
|-- feature_extractor_config.json
|-- model.safetensors
|-- preprocessor_config.json
|-- special_tokens_map.json
|-- tokenizer_config.json
`-- vocab.json
🛠️ Prepare Inference Data
Hallo3 has a few simple requirements for the input data of inference:
- Reference image must be 1:1 or 3:2 aspect ratio.
- Driving audio must be in WAV format.
- Audio must be in English since our training datasets are only in this language.
- Ensure the vocals of audio are clear; background music is acceptable.
🎮 Run Inference
Gradio UI
To run the Gradio UI simply run hallo3/app.py:
python hallo3/app.py

Batch
Simply to run the scripts/inference_long_batch.sh:
bash scripts/inference_long_batch.sh ./examples/inference/input.txt ./output
Animation results will be saved at ./output. You can find more examples for inference at examples folder.
Training
Prepare data for training
Begin your data-exploration by downloading the training dataset from the HuggingFace Dataset Repo. This dataset contains over 70 hours of talking-head videos, focusing on the speaker's face and speech, and more than 50 wild-scene clips from various real-world settings.
After downloading, simply unzip all the .tgz files to access the data and start your projects and organize them into the following directory structure:
dataset_name/
|-- videos/
| |-- 0001.mp4
| |-- 0002.mp4
| `-- 0003.mp4
|-- caption/
| |-- 0001.txt
| |-- 0002.txt
| `-- 0003.txt
You can use any dataset_name, but ensure the videos directory and caption directory are named as shown above.
Next, process the videos with the following commands:
bash scripts/data_preprocess.sh {dataset_name} {parallelism} {rank} {output_name}
Training
Update the data meta path settings in the configuration YAML files, configs/sft_s1.yaml and configs/sft_s2.yaml:
#sft_s1.yaml
train_data: [
"./data/output_name.json"
]
#sft_s2.yaml
train_data: [
"./data/output_name.json"
]
Start training with the following command:
# stage1
bash scripts/finetune_multi_gpus_s1.sh
# stage2
bash scripts/finetune_multi_gpus_s2.sh
📝 Citation
If you find our work useful for your research, please consider citing the paper:
@misc{cui2024hallo3,
title={Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer},
author={Jiahao Cui and Hui Li and Yun Zhan and Hanlin Shang and Kaihui Cheng and Yuqi Ma and Shan Mu and Hang
