MuseTalk
MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting
Install / Use
/learn @TMElyralab/MuseTalkREADME
MuseTalk
<strong>MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling</strong>
Yue Zhang<sup>*</sup>, Zhizhou Zhong<sup>*</sup>, Minhao Liu<sup>*</sup>, Zhaokang Chen, Bin Wu<sup>†</sup>, Yubin Zeng, Chao Zhan, Junxin Huang, Yingjie He, Wenjiang Zhou (<sup>*</sup>Equal Contribution, <sup>†</sup>Corresponding Author, benbinwu@tencent.com)
Lyra Lab, Tencent Music Entertainment
github huggingface space Technical report
We introduce MuseTalk, a real-time high quality lip-syncing model (30fps+ on an NVIDIA Tesla V100). MuseTalk can be applied with input videos, e.g., generated by MuseV, as a complete virtual human solution.
🔥 Updates
We're excited to unveil MuseTalk 1.5. This version (1) integrates training with perceptual loss, GAN loss, and sync loss, significantly boosting its overall performance. (2) We've implemented a two-stage training strategy and a spatio-temporal data sampling approach to strike a balance between visual quality and lip-sync accuracy. Learn more details here. The inference codes, training codes and model weights of MuseTalk 1.5 are all available now! 🚀
Overview
MuseTalk is a real-time high quality audio-driven lip-syncing model trained in the latent space of ft-mse-vae, which
- modifies an unseen face according to the input audio, with a size of face region of
256 x 256. - supports audio in various languages, such as Chinese, English, and Japanese.
- supports real-time inference with 30fps+ on an NVIDIA Tesla V100.
- supports modification of the center point of the face region proposes, which SIGNIFICANTLY affects generation results.
- checkpoint available trained on the HDTF and private dataset.
News
- [04/05/2025] :mega: We are excited to announce that the training code is now open-sourced! You can now train your own MuseTalk model using our provided training scripts and configurations.
- [03/28/2025] We are thrilled to announce the release of our 1.5 version. This version is a significant improvement over the 1.0 version, with enhanced clarity, identity consistency, and precise lip-speech synchronization. We update the technical report with more details.
- [10/18/2024] We release the technical report. Our report details a superior model to the open-source L1 loss version. It includes GAN and perceptual losses for improved clarity, and sync loss for enhanced performance.
- [04/17/2024] We release a pipeline that utilizes MuseTalk for real-time inference.
- [04/16/2024] Release Gradio demo on HuggingFace Spaces (thanks to HF team for their community grant)
- [04/02/2024] Release MuseTalk project and pretrained models.
Model
MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed
whisper-tiny model. The architecture of the generation network was borrowed from the UNet of the stable-diffusion-v1-4, where the audio embeddings were fused to the image embeddings by cross-attention.
Note that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is NOT a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with a single step.
Cases
<table> <tr> <td width="33%">Input Video
https://github.com/TMElyralab/MuseTalk/assets/163980830/37a3a666-7b90-4244-8d3a-058cb0e44107
https://github.com/user-attachments/assets/1ce3e850-90ac-4a31-a45f-8dfa4f2960ac
https://github.com/user-attachments/assets/fa3b13a1-ae26-4d1d-899e-87435f8d22b3
https://github.com/user-attachments/assets/15800692-39d1-4f4c-99f2-aef044dc3251
https://github.com/user-attachments/assets/a843f9c9-136d-4ed4-9303-4a7269787a60
https://github.com/user-attachments/assets/6eb4e70e-9e19-48e9-85a9-bbfa589c5fcb
</td> <td width="33%">MuseTalk 1.0
https://github.com/user-attachments/assets/c04f3cd5-9f77-40e9-aafd-61978380d0ef
https://github.com/user-attachments/assets/2051a388-1cef-4c1d-b2a2-3c1ceee5dc99
https://github.com/user-attachments/assets/b5f56f71-5cdc-4e2e-a519-454242000d32
https://github.com/user-attachments/assets/a5843835-04ab-4c31-989f-0995cfc22f34
https://github.com/user-attachments/assets/3dc7f1d7-8747-4733-bbdd-97874af0c028
https://github.com/user-attachments/assets/3c78064e-faad-4637-83ae-28452a22b09a
</td> <td width="33%">MuseTalk 1.5
https://github.com/user-attachments/assets/999a6f5b-61dd-48e1-b902-bb3f9cbc7247
https://github.com/user-attachments/assets/d26a5c9a-003c-489d-a043-c9a331456e75
https://github.com/user-attachments/assets/471290d7-b157-4cf6-8a6d-7e899afa302c
https://github.com/user-attachments/assets/1ee77c4c-8c70-4add-b6db-583a12faa7dc
https://github.com/user-attachments/assets/370510ea-624c-43b7-bbb0-ab5333e0fcc4
https://github.com/user-attachments/assets/b011ece9-a332-4bc1-b8b7-ef6e383d7bde
</td> </tr> </table>TODO:
- [x] trained models and inference codes.
- [x] Huggingface Gradio demo.
- [x] codes for real-time inference.
- [x] technical report.
- [x] a better model with updated technical report.
- [x] realtime inference code for 1.5 version.
- [x] training and data preprocessing codes.
- [ ] always welcome to submit issues and PRs to improve this repository! 😊
Getting Started
We provide a detailed tutorial about the installation and the basic usage of MuseTalk for new users:
Third party integration
Thanks for the third-party integration, which makes installation and use more convenient for everyone. We also hope you note that we have not verified, maintained, or updated third-party. Please refer to this project for specific results.
ComfyUI
Installation
To prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below:
Build environment
We recommend Python 3.10 and CUDA 11.7. Set up your environment as follows:
conda create -n MuseTalk python==3.10
conda activate MuseTalk
Install PyTorch 2.0.1
Choose one of the following installation methods:
# Option 1: Using pip
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
# Option 2: Using conda
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
Install Dependencies
Install the remaining required packages:
pip install -r requirements.txt
Install MMLab Packages
Install the MMLab ecosystem packages:
pip install --no-cache-dir -U openmim
mim install mmengine
mim install "mmcv==2.0.1"
mim install "mmdet==3.1.0"
mim install "mmpose==1.1.0"
Setup FFmpeg
-
Download the ffmpeg-static package
-
Configure FFmpeg based on your operating system:
For Linux:
export FFMPEG_PATH=/path/to/ffmpeg
# Example:
export FFMPEG_PATH=/musetalk/ffmpeg-4.4-amd64-static
For Windows:
Add the ffmpeg-xxx\bin directory to your system's PATH environment variable. Verify the installation by running ffmpeg -version in the command prompt - it should display the ffmpeg version information.
Download weights
You can download weights in two ways:
Option 1: Using Download Scripts
We provide two scripts for automatic downloading:
For Linux:
sh ./download_weights.sh
For Windows:
# Run the script
download_weights.bat
Option 2: Manual Download
You can also download the weights manually from the following links:
- Download our trained weights
- Download the weights of other components:
Finally, these weights should be organized in models as follows:
./models/
├── musetalk
│ └── musetalk.json
│ └── pytorch_model.bin
├── musetalkV15
│ └── musetalk.json
│ └── unet.pth
├── syncnet
│ └── latentsync_syncnet.pt
├── dwpose
│ └── dw-ll_ucoco_384.pth
├── face-parse-bisent
│ ├── 79999_iter.pth
│ └── resnet18-5c106cde.pth
├── sd-vae
│ ├── config.json
│ └── diffusion_pytorch_model.bin
└── whisper
├── config.json
├── pytorch_model.bin
└── preprocessor_config.json
Quickstart
Inference
We provide inference scripts for both versions of MuseTalk:
Prerequisites
Before running inference, please ensure ffmpeg is installed and accessible:
# Check ffmpeg installation
ffmpeg -version
If ffmpeg is not found, please install it first:
- Windows: Download from ffmpeg-static and add to PATH
- Linux:
sudo apt-get install ffmpeg
Normal Inference
Linux Environment
# MuseTalk 1.5 (Recommended)
sh inference.sh v1.5 normal
# MuseTalk 1.0
sh inference.sh v1.0 normal
Windows Environment
Related Skills
node-connect
346.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
346.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
346.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
