SkillAgentSearch skills...

Hallo2

[ICLR 2025] Hallo2: Long-Duration and High-Resolution Audio-driven Portrait Image Animation

Install / Use

/learn @fudan-generative-vision/Hallo2
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<h1 align='center'>Hallo2: Long-Duration and High-Resolution Audio-driven Portrait Image Animation</h1> <div align='center'> <a href='https://github.com/cuijh26' target='_blank'>Jiahao Cui</a><sup>1*</sup>&emsp; <a href='https://github.com/crystallee-ai' target='_blank'>Hui Li</a><sup>1*</sup>&emsp; <a href='https://yoyo000.github.io/' target='_blank'>Yao Yao</a><sup>3</sup>&emsp; <a href='http://zhuhao.cc/home/' target='_blank'>Hao Zhu</a><sup>3</sup>&emsp; <a href='https://github.com/NinoNeumann' target='_blank'>Hanlin Shang</a><sup>1</sup>&emsp; <a href='https://github.com/Kaihui-Cheng' target='_blank'>Kaihui Cheng</a><sup>1</sup>&emsp; <a href='' target='_blank'>Hang Zhou</a><sup>2</sup>&emsp; </div> <div align='center'> <a href='https://sites.google.com/site/zhusiyucs/home' target='_blank'>Siyu Zhu</a><sup>1✉️</sup>&emsp; <a href='https://jingdongwang2017.github.io/' target='_blank'>Jingdong Wang</a><sup>2</sup>&emsp; </div> <div align='center'> <sup>1</sup>Fudan University&emsp; <sup>2</sup>Baidu Inc&emsp; <sup>3</sup>Nanjing University </div> <div align='Center'> <i><strong><a href='https://iclr.cc/Conferences/2025' target='_blank'>ICLR 2025</a></strong></i> </div> <br> <div align='center'> <a href='https://github.com/fudan-generative-vision/hallo2'><img src='https://img.shields.io/github/stars/fudan-generative-vision/hallo2?style=social'></a> <a href='https://fudan-generative-vision.github.io/hallo2/#/'><img src='https://img.shields.io/badge/Project-HomePage-Green'></a> <a href='https://arxiv.org/abs/2410.07718'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/fudan-generative-ai/hallo2'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-yellow'></a> <a href='https://openbayes.com/console/public/tutorials/8KOlYWsdiY4'><img src='https://img.shields.io/badge/Demo-OpenBayes贝式计算-orange'></a> <a href='assets/wechat.jpeg'><img src='https://badges.aleen42.com/src/wechat.svg'></a> </div> <br>

📸 Showcase

<table class="center"> <tr> <td style="text-align: center"><b>Tailor Swift Speech @ NYU (4K, 23 minutes)</b></td> <td style="text-align: center"><b>Johan Rockstrom Speech @ TED (4K, 18 minutes)</b></td> </tr> <tr> <td style="text-align: center"><a target="_blank" href="https://cdn.aondata.work/hallo2/videos/showcases/TailorSpeech.mp4"><img src="https://cdn.aondata.work/hallo2/videos/showcases/gifs/TailorSpeechGIF.gif"></a></td> <td style="text-align: center"><a target="_blank" href="https://cdn.aondata.work/hallo2/videos/showcases/TEDSpeech.mp4"><img src="https://cdn.aondata.work/hallo2/videos/showcases/gifs/TEDSpeechGIF.gif"></a></td> </tr> <tr> <td style="text-align: center"><b>Churchill's Iron Curtain Speech (4K, 4 minutes)</b></td> <td style="text-align: center"><b>An LLM Course from Stanford (4K, up to 1 hour)</b></td> </tr> <tr> <td style="text-align: center"><a target="_blank" href="https://cdn.aondata.work/hallo2/videos/showcases/DarkestHour.mp4"><img src="https://cdn.aondata.work/hallo2/videos/showcases/gifs/DarkestHour.gif"></a></td> <td style="text-align: center"><a target="_blank" href="https://cdn.aondata.work/hallo2/videos/showcases/LLMCourse.mp4"><img src="https://cdn.aondata.work/hallo2/videos/showcases/gifs/LLMCourseGIF.gif"></a></td> </tr> </table>

Visit our project page to view more cases.

📰 News

  • 2025/01/23: 🎉🎉🎉 Our paper has been accepted to ICLR 2025.
  • 2024/10/16: ✨✨✨ Source code and pretrained weights released.
  • 2024/10/10: 🎉🎉🎉 Paper submitted on Arxiv.

📅️ Roadmap

| Status | Milestone | ETA | | :----: | :------------------------------------------------------------------------------------------- | :--------: | | ✅ | Paper submitted on Arixiv | 2024-10-10 | | ✅ | Source code meet everyone on GitHub | 2024-10-16 | | 🚀 | Accelerate performance on inference | TBD |

🔧️ Framework

framework

⚙️ Installation

  • System requirement: Ubuntu 20.04/Ubuntu 22.04, Cuda 11.8
  • Tested GPUs: A100

Download the codes:

  git clone https://github.com/fudan-generative-vision/hallo2
  cd hallo2

Create conda environment:

  conda create -n hallo python=3.10
  conda activate hallo

Install packages with pip

  pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
  pip install -r requirements.txt

Besides, ffmpeg is also needed:

  apt-get install ffmpeg

📥 Download Pretrained Models

You can easily get all pretrained models required by inference from our HuggingFace repo.

Using huggingface-cli to download the models:

cd $ProjectRootDir
pip install huggingface_hub
huggingface-cli download fudan-generative-ai/hallo2 --local-dir ./pretrained_models

Or you can download them separately from their source repo:

Finally, these pretrained models should be organized as follows:

./pretrained_models/
|-- audio_separator/
|   |-- download_checks.json
|   |-- mdx_model_data.json
|   |-- vr_model_data.json
|   `-- Kim_Vocal_2.onnx
|-- CodeFormer/
|   |-- codeformer.pth
|   `-- vqgan_code1024.pth
|-- face_analysis/
|   `-- models/
|       |-- face_landmarker_v2_with_blendshapes.task  # face landmarker model from mediapipe
|       |-- 1k3d68.onnx
|       |-- 2d106det.onnx
|       |-- genderage.onnx
|       |-- glintr100.onnx
|       `-- scrfd_10g_bnkps.onnx
|-- facelib
|   |-- detection_mobilenet0.25_Final.pth
|   |-- detection_Resnet50_Final.pth
|   |-- parsing_parsenet.pth
|   |-- yolov5l-face.pth
|   `-- yolov5n-face.pth
|-- hallo2
|   |-- net_g.pth
|   `-- net.pth
|-- motion_module/
|   `-- mm_sd_v15_v2.ckpt
|-- realesrgan
|   `-- RealESRGAN_x2plus.pth
|-- sd-vae-ft-mse/
|   |-- config.json
|   `-- diffusion_pytorch_model.safetensors
|-- stable-diffusion-v1-5/
|   `-- unet/
|       |-- config.json
|       `-- diffusion_pytorch_model.safetensors
`-- wav2vec/
    `-- wav2vec2-base-960h/
        |-- config.json
        |-- feature_extractor_config.json
        |-- model.safetensors
        |-- preprocessor_config.json
        |-- special_tokens_map.json
        |-- tokenizer_config.json
        `-- vocab.json

🛠️ Prepare Inference Data

Hallo has a few simple requirements for input data:

For the source image:

  1. It should be cropped into squares.
  2. The face should be the main focus, making up 50%-70% of the image.
  3. The face should be facing forward, with a rotation angle of less than 30° (no side profiles).

For the driving audio:

  1. It must be in WAV format.
  2. It must be in English since our training datasets are only in this language.
  3. Ensure the vocals are clear; background music is acceptable.

We have provided some samples for your reference.

🎮 Run Inference

Long-Duration animation

Simply to run the scripts/inference_long.py and change source_image, driving_audio and save_path in the config file:

python scripts/inference_long.py --config ./configs/inference/long.yaml

Animation results will be saved at save_path. You can find more examples for inference at examples folder.

For more options:

usage: inference_long.py [-h] [-c CONFIG] [--source_image SOURCE_I

Related Skills

View on GitHub
GitHub Stars3.7k
CategoryDevelopment
Updated1d ago
Forks530

Languages

Python

Security Score

95/100

Audited on Mar 20, 2026

No findings