LiveAvatar
Implementation of "Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length"
Install / Use
/learn @Alibaba-Quark/LiveAvatarREADME
<a href="https://arxiv.org/abs/2512.04677"><img src="https://img.shields.io/badge/arXiv-2512.04677-b31b1b.svg?style=for-the-badge" alt="arXiv"></a> <a href="https://huggingface.co/papers/2512.04677"><img src="https://img.shields.io/badge/🤗%20Daily%20Paper-ff9d00?style=for-the-badge" alt="Daily Paper"></a> <a href="https://huggingface.co/Quark-Vision/Live-Avatar"><img src="https://img.shields.io/badge/Hugging%20Face-Model-ffbd45?style=for-the-badge&logo=huggingface&logoColor=white" alt="HuggingFace"></a> <a href="https://github.com/Alibaba-Quark/LiveAvatar"><img src="https://img.shields.io/badge/Github-Code-black?style=for-the-badge&logo=github" alt="Github"></a> <a href="https://liveavatar.github.io/"><img src="https://img.shields.io/badge/Project-Page-blue?style=for-the-badge&logo=googlechrome&logoColor=white" alt="Project Page"></a>
</div><div align="center">TL;DR: Live Avatar is an algorithm–system co-designed framework that enables real-time, streaming, infinite-length interactive avatar video generation. Powered by a 14B-parameter diffusion model, it achieves 45 FPS on multi-card H800 GPUs with 4-step sampling and supports Block-wise Autoregressive processing for 10,000+ second streaming videos.
<strong>👀 More Demos:</strong> <br> 🤖 Human-AI Conversation | ♾️ Infinite Video | 🎭 Diverse Characters | 🎬 Animated Tech Explanation <br> <a href="https://liveavatar.github.io/"> <strong>👉 Click Here to Visit Project Page! 🌐</strong> </a> <br>
</div>✨ Highlights
- ⚡ Real-time Streaming Interaction - Achieve 45 FPS real-time streaming with low latency
- ♾️ Infinite-length Autoregressive Generation - Support 10,000+ second continuous video generation
- 🎨 Generalization Performances - Strong generalization across cartoon characters, singing, and diverse scenarios
📰 News
- [2026.1.20] 🚀 Major performance breakthrough (v1.1)! FP8 quantization enables inference on 48GB GPUs, while advanced compilation and cuDNN attention boost speed to ~2.5x peak and 3x average FPS. Achieving stable 45+ FPS on multi-H800 — share your results on different GPUs! Inference fixes also bring noticeable quality improvements, significantly surpassing the teacher model on qualitative metrics.
- [2025.12.16] 🎉 LiveAvatar has reached 1,000+ stars on GitHub! Thank you to the community for the incredible support! ⭐
- [2025.12.12] 🚀 We released single-gpu inference Code — no need for 5×H800 (house-priced server), a single 80GB VRAM GPU is enough to enjoy.
- [2025.12.08] 🚀 We released real-time inference Code and the model Weight.
- [2025.12.08] 🎉 LiveAvatar won the Hugging Face #1 Paper of the day!
- [2025.12.04] 🏃♂️ We committed to open-sourcing the code in early December.
- [2025.12.04] 🔥 We released Paper and demo page Website.
📑 Todo List
🌟 Early December (core code release)
- ✅ Release the paper
- ✅ Release the demo website
- ✅ Release checkpoints on Hugging Face
- ✅ Release Gradio Web UI
- ✅ Experimental real-time streaming inference on at least H800 GPUs
- ✅ Distribution-matching distillation to 4 steps
- ✅ Timestep-forcing pipeline parallelism
⚙️ Later updates
- ✅ Inference code supporting single GPU (offline generation)
- ✅ Multi-character support
- ✅ Inference Acceleration Stage1 (RoPE optimization, compilation, LoRA merge)
- ✅ Streaming-VAE intergration
- ✅ Inference Acceleration Stage2 (further compilation, fp8, cudnn attn)
- ⬜ UI integration for easily streaming interaction
- ⬜ TTS integration
- ⬜ Training code
- ⬜ LiveAvatar v1.2
🛠️ Installation
Please follow the steps below to set up the environment.
1. Create Environment
conda create -n liveavatar python=3.10 -y
conda activate liveavatar
2. Install CUDA Dependencies (optional)
conda install nvidia/label/cuda-12.4.1::cuda -y
conda install -c nvidia/label/cuda-12.4.1 cudatoolkit -y
3. Install PyTorch & Flash Attention
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
# If you are using NVIDIA Hopper architecture (H800/H200, etc.), FlashAttention 3 is recommended for a significant speedup:
pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch280 --extra-index-url https://download.pytorch.org/whl/cu128
# Otherwise, use FlashAttention 2:
pip install flash-attn==2.8.3 --no-build-isolation
4. Install Python Requirements
pip install -r requirements.txt
5. Install FFMPEG
apt-get update && apt-get install -y ffmpeg
📥 Download Models
Please download the pretrained checkpoints from links below and place them in the ./ckpt/ directory.
| Model Component | Description | Link |
| :--- | :--- | :---: |
| WanS2V-14B | base model| 🤗 Huggingface |
| liveAvatar | our lora model| 🤗 Huggingface |
# If you are in china mainland, run this first: export HF_ENDPOINT=https://hf-mirror.com
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.2-S2V-14B --local-dir ./ckpt/Wan2.2-S2V-14B
huggingface-cli download Quark-Vision/Live-Avatar --local-dir ./ckpt/LiveAvatar
After downloading, your directory structure should look like this:
ckpt/
├── Wan2.2-S2V-14B/ # Base model
│ ├── config.json
│ ├── diffusion_pytorch_model-*.safetensors
│ └── ...
└── LiveAvatar/ # Our LoRA model
├── liveavatar.safetensors
└── ...
🚀 Inference
Real-time Inference with TPP
💡 Currently, This command can run on GPUs with at least 80GB VRAM.
# CLI Inference
bash infinite_inference_multi_gpu.sh
# Gradio Web UI
bash gradio_multi_gpu.sh
💡 The model can generate videos from audio input combined with reference image and optional text prompt.
💡 The
sizeparameter represents the area of the generated video, with the aspect ratio following that of the original input image.
💡 The
--num_clipparameter controls the number of video clips generated, useful for quick preview with shorter generation time.
💡 Currently, our TPP pipeline requires five GPUs for inference. We are planning to develop a 3-step version that can be deployed on a 4-GPU cluster. Furthermore, we are planning to integrate the LightX2V VAE component. This integration will eliminate the dependency on additional single-GPU VAE parallelism and support 4-step inference within a 4-GPU setup.
💡 Compilation (
ENABLE_COMPILE): Enabling compilation will cause a long wait time during the first inference as the model compiles, but subsequent runs will see significant performance improvements. This is highly valuable for streaming long video scenarios. However, if you just want to quickly run a few test cases, we recommend disabling it by settingexport ENABLE_COMPILE=falsein your inference script.
💡 FP8 Quantization (
ENABLE_FP8): FP8 offers notable VRAM savings, enabling inference on 48GB GPUs, and also provides modest performance gains. Note that this may cause slight quality degradation. You can enable it by settingexport ENABLE_FP8=truein your inference script.
Please visit our project page to see more examples and learn about the scenarios suitable for this model.
Single-GPU Inference
💡 This command can run on a single GPU with at least 80GB VRAM.
# CLI Inference
bash infinite_inference_single_gpu.sh
# Gradio Web UI
bash gradio_single_gpu.sh
💡 If you encounter OOM errors after multiple runs in the Gr

