ARTalk
ARTalk generates realistic 3D head motions (lip sync, blinking, expressions, head poses) from audio in ⚡ real-time ⚡.
Install / Use
/learn @xg-chu/ARTalkREADME
Installation
Clone the project
git clone --recurse-submodules git@github.com:xg-chu/ARTalk.git
cd ARTalk
Build environment
conda env create -f environment.yml
conda activate ARTalk
Install the GAGAvatar module (if you want to use realistic avatars). If it is not installed, set load_gaga to False when initializing ARTAvatarInferEngine.
git clone --recurse-submodules git@github.com:xg-chu/diff-gaussian-rasterization.git
pip install ./diff-gaussian-rasterization
rm -rf ./diff-gaussian-rasterization
Prepare resources
Prepare resources with:
bash ./build_resources.sh
Quick Start Guide
Using <a href="https://github.com/gradio-app/gradio">Gradio</a> Interface
We provide a simple Gradio demo to demonstrate ARTalk's capabilities.
You can generate videos by uploading audio, recording audio, or entering text:
<h1 align="left"><b> <picture> <source srcset="./assets/dark_artalk_gradio.jpg" media="(prefers-color-scheme: dark)" width="512"> <img src="./assets/light_artalk_gradio.jpg" alt="Adaptive Image" width="512"> </picture> </b></h1> ``` python inference.py --run_app ```Command Line Usage
ARTalk can be used via command line:
python inference.py -a your_audio_path --shape_id your_apperance --style_id your_style_motion --clip_length 750
--shape_id can be specified with mesh or tracked real avatars stored in tracked.pt.
--style_id can be specified with the name of *.pt stored in assets/style_motion.
--clip_length sets the maximum duration of the rendered video and can be adjusted as needed. Longer videos may take more time to render.
The file tracked.pt is generated using <a href="https://github.com/xg-chu/GAGAvatar/blob/main/inference.py">GAGAvatar/inference.py</a>. Here I've included several examples of tracked avatars for quick testing.
The style motion is tracked with EMICA module in <a href="https://github.com/xg-chu/GAGAvatar_track">GAGAvatar_track </a>. Each contains 50*106 dimensional data. 50 is 2 seconds consecutive frames, 106 is 100 expression code and 6 pose code (base+jaw). Here I've included several examples of tracked style motion.
Training
This version modifies the VQVAE part compared to the paper version.
<!-- The training code and the paper version code are still in preparation and are expected to be released later. -->The training code has been released for reference. This code is also similar to the <a href="https://github.com/xg-chu/UniLS">UniLS training code</a>.
huggingface DockerFile
To use The DockerFile on huggingface, you have to change the Gradio port
Acknowledgements
We thank <a href="https://www.linkedin.com/in/lars-traaholt-vågnes-432725130/">Lars Traaholt Vågnes</a> and <a href="https://emmanueliarussi.github.io">Emmanuel Iarussi</a> from <a href="https://www.simli.com">Simli</a> for the insightful discussions! 🤗
The ARTalk logo was designed by Caihong Ning.
Some part of our work is built based on FLAME. We also thank the following projects for sharing their great work.
- GAGAvatar: https://github.com/xg-chu/GAGAvatar
- GPAvatar: https://github.com/xg-chu/GPAvatar
- FLAME: https://flame.is.tue.mpg.de
- EMICA: https://github.com/radekd91/inferno
Citation
If you find our work useful in your research, please consider citing:
@misc{
chu2025artalk,
title={ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model},
author={Xuangeng Chu and Nabarun Goswami and Ziteng Cui and Hanqin Wang and Tatsuya Harada},
year={2025},
eprint={2502.20323},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.20323},
}
