Hallo2
[ICLR 2025] Hallo2: Long-Duration and High-Resolution Audio-driven Portrait Image Animation
Install / Use
/learn @fudan-generative-vision/Hallo2README
📸 Showcase
<table class="center"> <tr> <td style="text-align: center"><b>Tailor Swift Speech @ NYU (4K, 23 minutes)</b></td> <td style="text-align: center"><b>Johan Rockstrom Speech @ TED (4K, 18 minutes)</b></td> </tr> <tr> <td style="text-align: center"><a target="_blank" href="https://cdn.aondata.work/hallo2/videos/showcases/TailorSpeech.mp4"><img src="https://cdn.aondata.work/hallo2/videos/showcases/gifs/TailorSpeechGIF.gif"></a></td> <td style="text-align: center"><a target="_blank" href="https://cdn.aondata.work/hallo2/videos/showcases/TEDSpeech.mp4"><img src="https://cdn.aondata.work/hallo2/videos/showcases/gifs/TEDSpeechGIF.gif"></a></td> </tr> <tr> <td style="text-align: center"><b>Churchill's Iron Curtain Speech (4K, 4 minutes)</b></td> <td style="text-align: center"><b>An LLM Course from Stanford (4K, up to 1 hour)</b></td> </tr> <tr> <td style="text-align: center"><a target="_blank" href="https://cdn.aondata.work/hallo2/videos/showcases/DarkestHour.mp4"><img src="https://cdn.aondata.work/hallo2/videos/showcases/gifs/DarkestHour.gif"></a></td> <td style="text-align: center"><a target="_blank" href="https://cdn.aondata.work/hallo2/videos/showcases/LLMCourse.mp4"><img src="https://cdn.aondata.work/hallo2/videos/showcases/gifs/LLMCourseGIF.gif"></a></td> </tr> </table>Visit our project page to view more cases.
📰 News
2025/01/23: 🎉🎉🎉 Our paper has been accepted to ICLR 2025.2024/10/16: ✨✨✨ Source code and pretrained weights released.2024/10/10: 🎉🎉🎉 Paper submitted on Arxiv.
📅️ Roadmap
| Status | Milestone | ETA | | :----: | :------------------------------------------------------------------------------------------- | :--------: | | ✅ | Paper submitted on Arixiv | 2024-10-10 | | ✅ | Source code meet everyone on GitHub | 2024-10-16 | | 🚀 | Accelerate performance on inference | TBD |
🔧️ Framework

⚙️ Installation
- System requirement: Ubuntu 20.04/Ubuntu 22.04, Cuda 11.8
- Tested GPUs: A100
Download the codes:
git clone https://github.com/fudan-generative-vision/hallo2
cd hallo2
Create conda environment:
conda create -n hallo python=3.10
conda activate hallo
Install packages with pip
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
Besides, ffmpeg is also needed:
apt-get install ffmpeg
📥 Download Pretrained Models
You can easily get all pretrained models required by inference from our HuggingFace repo.
Using huggingface-cli to download the models:
cd $ProjectRootDir
pip install huggingface_hub
huggingface-cli download fudan-generative-ai/hallo2 --local-dir ./pretrained_models
Or you can download them separately from their source repo:
- hallo: Our checkpoints consist of denoising UNet, face locator, image & audio proj.
- audio_separator: KimVocal_2 MDX-Net vocal removal model. (_Thanks to KimberleyJensen)
- insightface: 2D and 3D Face Analysis placed into
pretrained_models/face_analysis/models/. (Thanks to deepinsight) - face landmarker: Face detection & mesh model from mediapipe placed into
pretrained_models/face_analysis/models. - motion module: motion module from AnimateDiff. (Thanks to guoyww).
- sd-vae-ft-mse: Weights are intended to be used with the diffusers library. (Thanks to stablilityai)
- StableDiffusion V1.5: Initialized and fine-tuned from Stable-Diffusion-v1-2. (Thanks to runwayml)
- wav2vec: wav audio to vector model from Facebook.
- facelib: pretrained face parse models
- realesrgan: background upsample model
- CodeFormer: pretrained Codeformer model, it's optional to download it, only if you want to train our video super-resolution model from scratch
Finally, these pretrained models should be organized as follows:
./pretrained_models/
|-- audio_separator/
| |-- download_checks.json
| |-- mdx_model_data.json
| |-- vr_model_data.json
| `-- Kim_Vocal_2.onnx
|-- CodeFormer/
| |-- codeformer.pth
| `-- vqgan_code1024.pth
|-- face_analysis/
| `-- models/
| |-- face_landmarker_v2_with_blendshapes.task # face landmarker model from mediapipe
| |-- 1k3d68.onnx
| |-- 2d106det.onnx
| |-- genderage.onnx
| |-- glintr100.onnx
| `-- scrfd_10g_bnkps.onnx
|-- facelib
| |-- detection_mobilenet0.25_Final.pth
| |-- detection_Resnet50_Final.pth
| |-- parsing_parsenet.pth
| |-- yolov5l-face.pth
| `-- yolov5n-face.pth
|-- hallo2
| |-- net_g.pth
| `-- net.pth
|-- motion_module/
| `-- mm_sd_v15_v2.ckpt
|-- realesrgan
| `-- RealESRGAN_x2plus.pth
|-- sd-vae-ft-mse/
| |-- config.json
| `-- diffusion_pytorch_model.safetensors
|-- stable-diffusion-v1-5/
| `-- unet/
| |-- config.json
| `-- diffusion_pytorch_model.safetensors
`-- wav2vec/
`-- wav2vec2-base-960h/
|-- config.json
|-- feature_extractor_config.json
|-- model.safetensors
|-- preprocessor_config.json
|-- special_tokens_map.json
|-- tokenizer_config.json
`-- vocab.json
🛠️ Prepare Inference Data
Hallo has a few simple requirements for input data:
For the source image:
- It should be cropped into squares.
- The face should be the main focus, making up 50%-70% of the image.
- The face should be facing forward, with a rotation angle of less than 30° (no side profiles).
For the driving audio:
- It must be in WAV format.
- It must be in English since our training datasets are only in this language.
- Ensure the vocals are clear; background music is acceptable.
We have provided some samples for your reference.
🎮 Run Inference
Long-Duration animation
Simply to run the scripts/inference_long.py and change source_image, driving_audio and save_path in the config file:
python scripts/inference_long.py --config ./configs/inference/long.yaml
Animation results will be saved at save_path. You can find more examples for inference at examples folder.
For more options:
usage: inference_long.py [-h] [-c CONFIG] [--source_image SOURCE_I
Related Skills
node-connect
329.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
81.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
329.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
81.1kCommit, push, and open a PR
