MultiTalk
[NeurIPS 2025] Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Install / Use
/learn @MeiGen-AI/MultiTalkREADME
Zhe Kong* · Feng Gao* ·Yong Zhang<sup>✉</sup> · Zhuoliang Kang · Xiaoming Wei · Xunliang Cai
Guanying Chen · Wenhan Luo<sup>✉</sup>
<sup>*</sup>Equal Contribution <sup>✉</sup>Corresponding Authors
<a href='https://meigen-ai.github.io/multi-talk/'><img src='https://img.shields.io/badge/Project-Page-green'></a> <a href='https://arxiv.org/abs/2505.22647'><img src='https://img.shields.io/badge/Technique-Report-red'></a> <a href='https://huggingface.co/MeiGen-AI/MeiGen-MultiTalk'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>
</div><p align="center"> <img src="assets/pipe.png"> </p>TL; DR: MultiTalk is an audio-driven multi-person conversational video generation. It enables the video creation of multi-person conversation 💬, singing 🎤, interaction control 👬, and cartoon 🙊.
Video Demos
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> <tr> <td> <video src="https://github.com/user-attachments/assets/e55952e6-e1b2-44a5-9887-a89307a378da" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/f0396c19-d459-42aa-9d78-34fdea10de18" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/3576fd04-3e5f-4933-ac7b-1c4e6a601379" width="320" controls loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/5589056e-3202-442d-a62a-2cad7a7ecb19" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/554bfbe7-0090-492c-94be-329f5e39e175" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/9e961f35-9413-4846-a806-8186d54061da" width="320" controls loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/342595ab-cf75-4872-8182-f20fe8c95611" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/6476f9f0-35e0-4484-91a4-8aa646aa994a" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/d8fc8e94-0cba-4c25-9f3a-a8d7e0a785e1" width="320" controls loop></video> </td> </tr> </table>✨ Key Features
We propose MultiTalk , a novel framework for audio-driven multi-person conversational video generation. Given a multi-stream audio input, a reference image and a prompt, MultiTalk generates a video containing interactions following the prompt, with consistent lip motions aligned with the audio.
- 💬 Realistic Conversations - Support single & multi-person generation
- 👥 Interactive Character Control - Direct virtual humans via prompts
- 🎤 Generalization Performances - Support the generation of cartoon character and singing
- 📺 Resolution Flexibility: 480p & 720p output at arbitrary aspect ratios
- ⏱️ Long Video Generation: Support video generation up to 15 seconds
🔥 Latest News
- Dec 16, 2025: 🚀 We are excited to announce the release of LongCat-Video-Avatar, a unified model that delivers expressive and highly dynamic audio-driven character animation, supporting native tasks including Audio-Text-to-Video, Audio-Text-Image-to-Video, and Video Continuation with seamless compatibility for both single-stream and multi-stream audio inputs. The release includes our Technical Report, code, model weights, and project page.
- Aug 19, 2025: 🔥🔥 We released InfiniteTalk, a novel new paradigm for video dubbing. InfiniteTalk supports infinite-length video-to-video generation and image-to-video generation. Models, code, gradio, and comfyui have all been released.
- July 11, 2025: 🔥🔥
MultiTalksupports INT8 quantization and SageAttention2.2, and updates the CFG strategy (2 NFE per step) for FusionX LoRA, - July 01, 2025: 🔥🔥
MultiTalksupports input audios with TTS, FusioniX and lightx2v LoRA acceleration (requires only 4~8 steps), and Gradio. - June 14, 2025: 🔥🔥 We release
MultiTalkwith support formulti-GPU inference,teacache acceleration,APGandlow-VRAM inference(enabling 480P video generation on a single RTX 4090). APG is used to alleviate the color error accumulation in long video generation. TeaCache is capable of increasing speed by approximately 2~3x. - June 9, 2025: 🔥🔥 We release the weights and inference code of MultiTalk
- May 29, 2025: We release the Technique-Report of MultiTalk
- May 29, 2025: We release the project page of MultiTalk
🌐 Community Works
- Wan2GP: thank deepbeepmeep for providing the project Wan2GP that enables Multitalk on very low VRAM hardware (8 GB of VRAM) and combines it with the capabilities of Vace.
- Replicate: thank zsxkib for pushing MultiTalk to Replicate platform, try it! Please refer to cog-MultiTalk for details.
- Gradio Demo: thank fffiloni for developing this gradio demo on Hugging Face. Please refer to the issue for details.
- ComfyUI: thank kijai for integrating MultiTalk into ComfyUI-WanVideoWrapper. Rudra found something interesting that MultiTalk can be combined with Wanx T2V and VACE in the issue.
- Google Colab example, an exmaple for inference on A100 provided by Braffolk.
📑 Todo List
- [x] Release the technical report
- [x] Inference
- [x] Checkpoints
- [x] Multi-GPU Inference
- [ ] Inference acceleration
- [x] TeaCache
- [x] int8 quantization
- [ ] LCM distillation
- [ ] Sparse Attention
- [x] Run with very low VRAM
- [x] TTS integration
- [x] Gradio demo
- [ ] ComfyUI
- [ ] 1.3B model
Quick Start
🛠️Installation
1. Create a conda environment and install pytorch, xformers
conda create -n multitalk python=3.10
conda activate multitalk
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install -U xformers==0.0.28 --index-url https://download.pytorch.org/whl/cu121
2. Flash-attn installation:
pip install misaki[en]
pip install ninja
pip install psutil
pip install packaging
pip install flash_attn==2.7.4.post1
3. Other dependencies
pip install -r requirements.txt
conda install -c conda-forge librosa
4. FFmeg installation
conda install -c conda-forge ffmpeg
or
sudo yum install ffmpeg ffmpeg-devel
🧱Model Preparation
1. Model Download
| Models | Download Link | Notes | | --------------|-------------------------------------------------------------------------------|-------------------------------| | Wan2.1-I2V-14B-480P | 🤗 Huggingface | Base model | chinese-wav2vec2-base | 🤗 Huggingface | Audio encoder | Kokoro-82M | 🤗 Huggingface | TTS weights | MeiGen-MultiTalk | 🤗 Huggingface | Our audio condition weights
Download models using huggingface-cli:
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./weights/Wan2.1-I2V-14B-480P
huggingface-cli download TencentGameMate/chinese-wav2vec2-base --local-dir ./weights/chinese-wav2vec2-base
huggingface-cli download TencentGameMate/chinese-wav2vec2-base model.safetensors --revision refs/pr/1 --local-dir ./weights/chinese-wav2vec2-base
huggingface-cli download hexgrad/Kokoro-82M --local-dir ./weights/Kokoro-82M
huggingface-cli download MeiGen-AI/MeiGen-MultiTalk --local-dir ./weights/MeiGen-MultiTalk
