StableAvatar
We present StableAvatar, the first end-to-end video diffusion transformer, which synthesizes infinite-length high-quality audio-driven avatar videos without any post-processing, conditioned on a reference image and audio.
Install / Use
/learn @Francis-Rings/StableAvatarREADME
StableAvatar
<a href='https://francis-rings.github.io/StableAvatar'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2508.08248'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/FrancisRing/StableAvatar/tree/main'><img src='https://img.shields.io/badge/HuggingFace-Model-orange'></a> <a href='https://huggingface.co/spaces/YinmingHuang/StableAvatar'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo-blue'></a> <a href='https://www.youtube.com/watch?v=6lhvmbzvv3Y'><img src='https://img.shields.io/badge/YouTube-Watch-red?style=flat-square&logo=youtube'></a> <a href='https://www.bilibili.com/video/BV1hUt9z4EoQ'><img src='https://img.shields.io/badge/Bilibili-Watch-blue?style=flat-square&logo=bilibili'></a>
StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation <br/> Shuyuan Tu<sup>1</sup>, Yueming Pan<sup>3</sup>, Yinming Huang<sup>1</sup>, Xintong Han<sup>4</sup>, Zhen Xing<sup>1</sup>, Qi Dai<sup>2</sup>, Chong Luo<sup>2</sup>, Zuxuan Wu<sup>1</sup>, Yu-Gang Jiang<sup>1</sup> <br/> [<sup>1</sup>Fudan University; <sup>2</sup>Microsoft Research Asia; <sup>3</sup>Xi'an Jiaotong University; <sup>4</sup>Tencent Inc]
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> <tr> <td> <video src="https://github.com/user-attachments/assets/d7eca208-6a14-46af-b337-fb4d2b66ba8d" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/b15784b1-c013-4126-a764-10c844341a4e" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/87faa5c1-a118-4a03-a071-45f18e87e6a0" width="320" controls loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/531eb413-8993-4f8f-9804-e3c5ec5794d4" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/cdc603e2-df46-4cf8-a14e-1575053f996f" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/7022dc93-f705-46e5-b8fc-3a3fb755795c" width="320" controls loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/0ba059eb-ff6f-4d94-80e6-f758c613b737" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/03e6c1df-85c6-448d-b40d-aacb8add4e45" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/90b78154-dda0-4eaa-91fd-b5485b718a7f" width="320" controls loop></video> </td> </tr> </table> <p style="text-align: justify;"> <span>Audio-driven avatar videos generated by StableAvatar, showing its power to synthesize <b>infinite-length</b> and <b>ID-preserving videos</b>. All videos are <b>directly synthesized by StableAvatar without the use of any face-related post-processing tools</b>, such as the face-swapping tool FaceFusion or face restoration models like GFP-GAN and CodeFormer.</span> </p> <p align="center"> <video src="https://github.com/user-attachments/assets/90691318-311e-40b9-9bd9-62db83ab1492" width="768" autoplay loop muted playsinline></video> <br/> <span>Comparison results between StableAvatar and state-of-the-art (SOTA) audio-driven avatar video generation models highlight the superior performance of StableAvatar in delivering <b>infinite-length, high-fidelity, identity-preserving avatar animation</b>.</span> </p>Overview
<p align="center"> <img src="assets/figures/framework.jpg" alt="model architecture" width="1280"/> </br> <i>The overview of the framework of StableAvatar.</i> </p>Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually. To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion’s own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively.
News
[2025-10-12]: 🔥 We fixed the inconsistency between multi-GPU and single-GPU inference. The patch has been merged—please update to the latest version and give it a try! 🙌[2025-9-8]:🔥 We are thrilled to release an interesting 🔥🔥 brand new demo 🔥🔥! The generated videos can be seen on YouTube and Bilibili.[2025-8-29]:🔥 StableAvatar public demo is now live on Hugging Face Spaces. (Note: due to the long video generation time, the demo is currently accessible to <b>Hugging Face Pro</b> users only.)[2025-8-18]:🔥 StableAvatar can run on ComfyUI in just 10 steps, making it 3x faster. Thanks @smthemex for the contribution.[2025-8-16]:🔥 We release the finetuning codes and lora training/finetuning codes! Other codes will be public as soon as possible. Stay tuned![2025-8-15]:🔥 StableAvatar can run on Gradio Interface. Thanks @gluttony-10 for the contribution![2025-8-15]:🔥 StableAvatar can run on ComfyUI. Thanks @smthemex for the contribution.[2025-8-13]:🔥 Added changes to run StableAvatar on the new Blackwell series Nvidia chips, including the RTX 6000 Pro.[2025-8-11]:🔥 The project page, code, technical report and a basic model checkpoint are released. Further lora training codes, the evaluation dataset and StableAvatar-pro will be released very soon. Stay tuned!
🛠️ To-Do List
- [x] StableAvatar-1.3B-basic
- [x] Inference Code
- [x] Data Pre-Processing Code (Audio Extraction)
- [x] Data Pre-Processing Code (Vocal Separation)
- [x] Training Code
- [x] Full Finetuning Code
- [x] Lora Training Code
- [x] Lora Finetuning Code
- [ ] Inference Code with Audio Native Guidance
- [ ] StableAvatar-pro
🔑 Quickstart
For the basic version of the model checkpoint (Wan2.1-1.3B-based), it supports generating <b>infinite-length videos at a 480x832 or 832x480 or 512x512 resolution</b>. If you encounter insufficient memory issues, you can appropriately reduce the number of animated frames or the resolution of the output.
🧱 Environment setup
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
# Optional to install flash_attn to accelerate attention computation
pip install flash_attn
🧱 Environment setup for Blackwell series chips
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
# Optional to install flash_attn to accelerate attention computation
pip install flash_attn
🧱 Download weights
If you encounter connection issues with Hugging Face, you can utilize the mirror endpoint by setting the environment variable: export HF_ENDPOINT=https://hf-mirror.com.
Please download weights manually as follows:
pip install "huggingface_hub[cli]"
cd StableAvatar
mkdir checkpoints
huggingface-cli download FrancisRing/StableAvatar --local-dir ./checkpoints
All the weights should be organized in models as follows The overall file structure of this project should be organized as follows:
StableAvatar/
├── accelerate_config
├── deepspeed_config
├── examples
├── wan
├── checkpoints
│ ├── Kim_Vocal_2.onnx
│ ├── wav2vec2-base-960h
│ ├── Wan2.1-Fun-V1.1-1.3B-InP
│ └── StableAvatar-1.3B
├── inference.py
├── inference.sh
├── train_1B_square.py
├── train_1B_square.sh
├── train_1B_vec_rec.py
├── train_1B_vec_rec.sh
├── audio_extractor.py
├── vocal_seperator.py
├── requirement.txt
🧱 Audio Extraction
Given the target video file (.mp4), you can use the following command to obtain the corresponding audio file (.wav):
python audio_extractor.py --video_path="path/test/video.mp4" --saved_audio_path="path/test/audio.wav"
🧱 Vocal Separation
As noisy background music may negatively impact the performance of StableAvatar to some extents, you can further separate the vocal from the audio file for better lip synchronization. Given the path to an audio file (.wav), you can run the following command to extract the co
Related Skills
docs-writer
99.2k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
337.4kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
ddd
Guía de Principios DDD para el Proyecto > 📚 Documento Complementario : Este documento define los principios y reglas de DDD. Para ver templates de código, ejemplos detallados y guías paso
arscontexta
2.9kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
