Hallo2

[ICLR 2025] Hallo2: Long-Duration and High-Resolution Audio-driven Portrait Image Animation

Generate Convert Improve

Install / Use

/learn @fudan-generative-vision/Hallo2

About this skill

Quality Score

0/100

README

<h1 align='center'>Hallo2: Long-Duration and High-Resolution Audio-driven Portrait Image Animation</h1> <div align='center'> <a href='https://github.com/cuijh26' target='_blank'>Jiahao Cui</a><sup>1*</sup>&emsp; <a href='https://github.com/crystallee-ai' target='_blank'>Hui Li</a><sup>1*</sup>&emsp; <a href='https://yoyo000.github.io/' target='_blank'>Yao Yao</a><sup>3</sup>&emsp; <a href='http://zhuhao.cc/home/' target='_blank'>Hao Zhu</a><sup>3</sup>&emsp; <a href='https://github.com/NinoNeumann' target='_blank'>Hanlin Shang</a><sup>1</sup>&emsp; <a href='https://github.com/Kaihui-Cheng' target='_blank'>Kaihui Cheng</a><sup>1</sup>&emsp; <a href='' target='_blank'>Hang Zhou</a><sup>2</sup>&emsp; </div> <div align='center'> <a href='https://sites.google.com/site/zhusiyucs/home' target='_blank'>Siyu Zhu</a><sup>1✉️</sup>&emsp; <a href='https://jingdongwang2017.github.io/' target='_blank'>Jingdong Wang</a><sup>2</sup>&emsp; </div> <div align='center'> <sup>1</sup>Fudan University&emsp; <sup>2</sup>Baidu Inc&emsp; <sup>3</sup>Nanjing University </div> <div align='Center'> <i><strong><a href='https://iclr.cc/Conferences/2025' target='_blank'>ICLR 2025</a></strong></i> </div> <br> <div align='center'> <a href='https://github.com/fudan-generative-vision/hallo2'><img src='https://img.shields.io/github/stars/fudan-generative-vision/hallo2?style=social'></a> <a href='https://fudan-generative-vision.github.io/hallo2/#/'><img src='https://img.shields.io/badge/Project-HomePage-Green'></a> <a href='https://arxiv.org/abs/2410.07718'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/fudan-generative-ai/hallo2'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-yellow'></a> <a href='https://openbayes.com/console/public/tutorials/8KOlYWsdiY4'><img src='https://img.shields.io/badge/Demo-OpenBayes贝式计算-orange'></a> <a href='assets/wechat.jpeg'><img src='https://badges.aleen42.com/src/wechat.svg'></a> </div> <br>

📸 Showcase

<table class="center"> <tr> <td style="text-align: center"><b>Tailor Swift Speech @ NYU (4K, 23 minutes)</b></td> <td style="text-align: center"><b>Johan Rockstrom Speech @ TED (4K, 18 minutes)</b></td> </tr> <tr> <td style="text-align: center"><a target="_blank" href="https://cdn.aondata.work/hallo2/videos/showcases/TailorSpeech.mp4"><img src="https://cdn.aondata.work/hallo2/videos/showcases/gifs/TailorSpeechGIF.gif"></a></td> <td style="text-align: center"><a target="_blank" href="https://cdn.aondata.work/hallo2/videos/showcases/TEDSpeech.mp4"><img src="https://cdn.aondata.work/hallo2/videos/showcases/gifs/TEDSpeechGIF.gif"></a></td> </tr> <tr> <td style="text-align: center"><b>Churchill's Iron Curtain Speech (4K, 4 minutes)</b></td> <td style="text-align: center"><b>An LLM Course from Stanford (4K, up to 1 hour)</b></td> </tr> <tr> <td style="text-align: center"><a target="_blank" href="https://cdn.aondata.work/hallo2/videos/showcases/DarkestHour.mp4"><img src="https://cdn.aondata.work/hallo2/videos/showcases/gifs/DarkestHour.gif"></a></td> <td style="text-align: center"><a target="_blank" href="https://cdn.aondata.work/hallo2/videos/showcases/LLMCourse.mp4"><img src="https://cdn.aondata.work/hallo2/videos/showcases/gifs/LLMCourseGIF.gif"></a></td> </tr> </table>

Visit our project page to view more cases.

📰 News

2025/01/23: 🎉🎉🎉 Our paper has been accepted to ICLR 2025.
2024/10/16: ✨✨✨ Source code and pretrained weights released.
2024/10/10: 🎉🎉🎉 Paper submitted on Arxiv.

📅️ Roadmap

| Status | Milestone | ETA | | :----: | :------------------------------------------------------------------------------------------- | :--------: | | ✅ | Paper submitted on Arixiv | 2024-10-10 | | ✅ | Source code meet everyone on GitHub | 2024-10-16 | | 🚀 | Accelerate performance on inference | TBD |

🔧️ Framework

framework

⚙️ Installation

System requirement: Ubuntu 20.04/Ubuntu 22.04, Cuda 11.8
Tested GPUs: A100

Download the codes:

  git clone https://github.com/fudan-generative-vision/hallo2
  cd hallo2

Create conda environment:

  conda create -n hallo python=3.10
  conda activate hallo

Install packages with pip

  pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
  pip install -r requirements.txt

Besides, ffmpeg is also needed:

  apt-get install ffmpeg

📥 Download Pretrained Models

You can easily get all pretrained models required by inference from our HuggingFace repo.

Using huggingface-cli to download the models:

cd $ProjectRootDir
pip install huggingface_hub
huggingface-cli download fudan-generative-ai/hallo2 --local-dir ./pretrained_models

Or you can download them separately from their source repo:

hallo: Our checkpoints consist of denoising UNet, face locator, image & audio proj.
audio_separator: KimVocal_2 MDX-Net vocal removal model. (_Thanks to KimberleyJensen)
insightface: 2D and 3D Face Analysis placed into pretrained_models/face_analysis/models/. (Thanks to deepinsight)
face landmarker: Face detection & mesh model from mediapipe placed into pretrained_models/face_analysis/models.
motion module: motion module from AnimateDiff. (Thanks to guoyww).
sd-vae-ft-mse: Weights are intended to be used with the diffusers library. (Thanks to stablilityai)
StableDiffusion V1.5: Initialized and fine-tuned from Stable-Diffusion-v1-2. (Thanks to runwayml)
wav2vec: wav audio to vector model from Facebook.
facelib: pretrained face parse models
realesrgan: background upsample model
CodeFormer: pretrained Codeformer model, it's optional to download it, only if you want to train our video super-resolution model from scratch

Finally, these pretrained models should be organized as follows:

./pretrained_models/
|-- audio_separator/
|   |-- download_checks.json
|   |-- mdx_model_data.json
|   |-- vr_model_data.json
|   `-- Kim_Vocal_2.onnx
|-- CodeFormer/
|   |-- codeformer.pth
|   `-- vqgan_code1024.pth
|-- face_analysis/
|   `-- models/
|       |-- face_landmarker_v2_with_blendshapes.task  # face landmarker model from mediapipe
|       |-- 1k3d68.onnx
|       |-- 2d106det.onnx
|       |-- genderage.onnx
|       |-- glintr100.onnx
|       `-- scrfd_10g_bnkps.onnx
|-- facelib
|   |-- detection_mobilenet0.25_Final.pth
|   |-- detection_Resnet50_Final.pth
|   |-- parsing_parsenet.pth
|   |-- yolov5l-face.pth
|   `-- yolov5n-face.pth
|-- hallo2
|   |-- net_g.pth
|   `-- net.pth
|-- motion_module/
|   `-- mm_sd_v15_v2.ckpt
|-- realesrgan
|   `-- RealESRGAN_x2plus.pth
|-- sd-vae-ft-mse/
|   |-- config.json
|   `-- diffusion_pytorch_model.safetensors
|-- stable-diffusion-v1-5/
|   `-- unet/
|       |-- config.json
|       `-- diffusion_pytorch_model.safetensors
`-- wav2vec/
    `-- wav2vec2-base-960h/
        |-- config.json
        |-- feature_extractor_config.json
        |-- model.safetensors
        |-- preprocessor_config.json
        |-- special_tokens_map.json
        |-- tokenizer_config.json
        `-- vocab.json

🛠️ Prepare Inference Data

Hallo has a few simple requirements for input data:

For the source image:

It should be cropped into squares.
The face should be the main focus, making up 50%-70% of the image.
The face should be facing forward, with a rotation angle of less than 30° (no side profiles).

For the driving audio:

It must be in WAV format.
It must be in English since our training datasets are only in this language.
Ensure the vocals are clear; background music is acceptable.

We have provided some samples for your reference.

🎮 Run Inference

Long-Duration animation

Simply to run the scripts/inference_long.py and change source_image, driving_audio and save_path in the config file:

python scripts/inference_long.py --config ./configs/inference/long.yaml

Animation results will be saved at save_path. You can find more examples for inference at examples folder.

For more options:

usage: inference_long.py [-h] [-c CONFIG] [--source_image SOURCE_I

Related Skills

node-connect

329.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

81.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

329.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

81.1k

Commit, push, and open a PR