OmniAvatar
No description available
Install / Use
/learn @Omni-Avatar/OmniAvatarREADME
Qijun Gan · Ruizi Yang · Jianke Zhu · Shaofei Xue · Steven Hoi
Zhejiang University, Alibaba Group
<div align="center"> <a href="https://omni-avatar.github.io/"><img src="https://img.shields.io/badge/Project-OmniAvatar-blue.svg"></a>   <a href="http://arxiv.org/abs/2506.18866"><img src="https://img.shields.io/badge/Arxiv-2506.18866-b31b1b.svg?logo=arXiv"></a>   <a href="https://huggingface.co/OmniAvatar/OmniAvatar-14B"><img src="https://img.shields.io/badge/🤗-OmniAvatar-red.svg"></a>   <a href="https://huggingface.co/spaces/alexnasa/OmniAvatar"><img src="https://img.shields.io/badge/🤗-HF Demo-yellow.svg"></a> </div> </div>
🔥 Latest News!!
- July 2-nd, 2025: We released the model weights for Wan 1.3B!
- June 24-th, 2025: We released the inference code and model weights!
Quickstart
🛠️Installation
Clone the repo:
git clone https://github.com/Omni-Avatar/OmniAvatar
cd OmniAvatar
Install dependencies:
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
# Optional to install flash_attn to accelerate attention computation
pip install flash_attn
🧱Model Download
| Models | Download Link | Notes | |-----------------------|-------------------------------------------------------------------------------|-------------------------------| | Wan2.1-T2V-14B | 🤗 Huggingface | Base model for 14B | OmniAvatar model 14B | 🤗 Huggingface | Our LoRA and audio condition weights | Wan2.1-T2V-1.3B | 🤗 Huggingface | Base model for 1.3B | OmniAvatar model 1.3B | 🤗 Huggingface | Our LoRA and audio condition weights | Wav2Vec | 🤗 Huggingface | Audio encoder
Download models using huggingface-cli:
mkdir pretrained_models
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./pretrained_models/Wan2.1-T2V-14B
huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./pretrained_models/wav2vec2-base-960h
huggingface-cli download OmniAvatar/OmniAvatar-14B --local-dir ./pretrained_models/OmniAvatar-14B
File structure (Samples for 14B)
OmniAvatar
├── pretrained_models
│ ├── Wan2.1-T2V-14B
│ │ ├── ...
│ ├── OmniAvatar-14B
│ │ ├── config.json
│ │ └── pytorch_model.pt
│ └── wav2vec2-base-960h
│ ├── ...
🔑 Inference
# 480p only for now
# 14B
torchrun --standalone --nproc_per_node=1 scripts/inference.py --config configs/inference.yaml --input_file examples/infer_samples.txt
# 1.3B
torchrun --standalone --nproc_per_node=1 scripts/inference.py --config configs/inference_1.3B.yaml --input_file examples/infer_samples.txt
💡Tips
-
You can control the character's behavior through the prompt in
examples/infer_samples.txt, and its format is[prompt]@@[img_path]@@[audio_path]. The recommended range for prompt and audio cfg is [4-6]. You can increase the audio cfg to achieve more consistent lip-sync. -
Control prompts guidance and audio guidance respectively, and use
audio_scale=3to control audio guidance separately. At this time,guidance_scaleonly controls prompts. -
To speed up, the recommanded
num_stepsrange is [20-50], more steps bring higher quality. To use multi-gpu inference, just setsp_size=$GPU_NUM. To use TeaCache, you can settea_cache_l1_thresh=0.14, and the recommanded range is [0.05-0.15]. -
To reduce GPU memory storage, you can set
use_fsdp=Trueandnum_persistent_param_in_dit. An example command is as follows:
torchrun --standalone --nproc_per_node=8 scripts/inference.py --config configs/inference.yaml --input_file examples/infer_samples.txt --hp=sp_size=8,max_tokens=30000,guidance_scale=4.5,overlap_frame=13,num_steps=25,use_fsdp=True,tea_cache_l1_thresh=0.14,num_persistent_param_in_dit=7000000000
We present a detailed table here. The model is tested on A800.
|model_size|torch_dtype|GPU_NUM|use_fsdp|num_persistent_param_in_dit|Speed|Required VRAM|
|-|-|-|-|-|-|-|
|14B|torch.bfloat16|1|False|None (unlimited)|16.0s/it|36G|
|14B|torch.bfloat16|1|False|7*10**9 (7B)|19.4s/it|21G|
|14B|torch.bfloat16|1|False|0|22.1s/it|8G|
|14B|torch.bfloat16|4|True|None (unlimited)|4.8s/it|14.3G|
We train train 14B under 30000 tokens for 480p videos. We found that using more tokens when inference can also have good results. You can try 60000, 80000. Overlap overlap_frame can be set as 1 or 13. 13 could have more coherent generation, but error propagation is more severe.
- ❕Prompts are also very important. It is recommended to
[Description of first frame]-[Description of human behavior]-[Description of background (optional)]
🧩 Community Works
We ❤️ contributions from the open-source community! If your work has improved OmniAvatar, please inform us. Or you can directly e-mail ganqijun@zju.edu.cn. We are happy to reference your project for everyone's convenience. 🥸Have Fun!
🔗Citation
If you find this repository useful, please consider giving a star ⭐ and citation
@misc{gan2025omniavatar,
title={OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation},
author={Qijun Gan and Ruizi Yang and Jianke Zhu and Shaofei Xue and Steven Hoi},
year={2025},
eprint={2506.18866},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.18866},
}
Acknowledgments
Thanks to Wan2.1, FantasyTalking and DiffSynth-Studio for open-sourcing their models and code, which provided valuable references and support for this project. Their contributions to the open-source community are truly appreciated.
Related Skills
node-connect
335.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
335.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.5kCommit, push, and open a PR