<div align="center"> <p align="center"> <img src="assets/logo2.jpeg" alt="MultiTalk" width="240"/> </p> <h1>Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation (NeurIPS 2025)</h1>

Zhe Kong* · Feng Gao* ·Yong Zhang<sup>✉</sup> · Zhuoliang Kang · Xiaoming Wei · Xunliang Cai

Guanying Chen · Wenhan Luo<sup>✉</sup>

<sup>*</sup>Equal Contribution <sup>✉</sup>Corresponding Authors

</div>

TL; DR: MultiTalk is an audio-driven multi-person conversational video generation. It enables the video creation of multi-person conversation 💬, singing 🎤, interaction control 👬, and cartoon 🙊.

Video Demos

✨ Key Features

We propose MultiTalk , a novel framework for audio-driven multi-person conversational video generation. Given a multi-stream audio input, a reference image and a prompt, MultiTalk generates a video containing interactions following the prompt, with consistent lip motions aligned with the audio.

💬 Realistic Conversations - Support single & multi-person generation

👥 Interactive Character Control - Direct virtual humans via prompts

🎤 Generalization Performances - Support the generation of cartoon character and singing

📺 Resolution Flexibility: 480p & 720p output at arbitrary aspect ratios

⏱️ Long Video Generation: Support video generation up to 15 seconds

🔥 Latest News

Dec 16, 2025: 🚀 We are excited to announce the release of LongCat-Video-Avatar, a unified model that delivers expressive and highly dynamic audio-driven character animation, supporting native tasks including Audio-Text-to-Video, Audio-Text-Image-to-Video, and Video Continuation with seamless compatibility for both single-stream and multi-stream audio inputs. The release includes our Technical Report, code, model weights, and project page.
Aug 19, 2025: 🔥🔥 We released InfiniteTalk, a novel new paradigm for video dubbing. InfiniteTalk supports infinite-length video-to-video generation and image-to-video generation. Models, code, gradio, and comfyui have all been released.
July 11, 2025: 🔥🔥 MultiTalk supports INT8 quantization and SageAttention2.2, and updates the CFG strategy (2 NFE per step) for FusionX LoRA,
July 01, 2025: 🔥🔥 MultiTalk supports input audios with TTS, FusioniX and lightx2v LoRA acceleration (requires only 4~8 steps), and Gradio.
June 14, 2025: 🔥🔥 We release MultiTalk with support for multi-GPU inference, teacache acceleration, APG and low-VRAM inference (enabling 480P video generation on a single RTX 4090). APG is used to alleviate the color error accumulation in long video generation. TeaCache is capable of increasing speed by approximately 2~3x.
June 9, 2025: 🔥🔥 We release the weights and inference code of MultiTalk
May 29, 2025: We release the Technique-Report of MultiTalk
May 29, 2025: We release the project page of MultiTalk

🌐 Community Works

Wan2GP: thank deepbeepmeep for providing the project Wan2GP that enables Multitalk on very low VRAM hardware (8 GB of VRAM) and combines it with the capabilities of Vace.
Replicate: thank zsxkib for pushing MultiTalk to Replicate platform, try it! Please refer to cog-MultiTalk for details.
Gradio Demo: thank fffiloni for developing this gradio demo on Hugging Face. Please refer to the issue for details.
ComfyUI: thank kijai for integrating MultiTalk into ComfyUI-WanVideoWrapper. Rudra found something interesting that MultiTalk can be combined with Wanx T2V and VACE in the issue.
Google Colab example, an exmaple for inference on A100 provided by Braffolk.

📑 Todo List

[x] Release the technical report
[x] Inference
[x] Checkpoints
[x] Multi-GPU Inference
[ ] Inference acceleration
- [x] TeaCache
- [x] int8 quantization
- [ ] LCM distillation
- [ ] Sparse Attention
[x] Run with very low VRAM
[x] TTS integration
[x] Gradio demo
[ ] ComfyUI
[ ] 1.3B model

Quick Start

🛠️Installation

1. Create a conda environment and install pytorch, xformers

conda create -n multitalk python=3.10
conda activate multitalk
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install -U xformers==0.0.28 --index-url https://download.pytorch.org/whl/cu121

2. Flash-attn installation:

pip install misaki[en]
pip install ninja 
pip install psutil 
pip install packaging 
pip install flash_attn==2.7.4.post1

3. Other dependencies

pip install -r requirements.txt
conda install -c conda-forge librosa

4. FFmeg installation

conda install -c conda-forge ffmpeg

sudo yum install ffmpeg ffmpeg-devel

🧱Model Preparation

1. Model Download

| Models | Download Link | Notes | | --------------|-------------------------------------------------------------------------------|-------------------------------| | Wan2.1-I2V-14B-480P | 🤗 Huggingface | Base model | chinese-wav2vec2-base | 🤗 Huggingface | Audio encoder | Kokoro-82M | 🤗 Huggingface | TTS weights | MeiGen-MultiTalk | 🤗 Huggingface | Our audio condition weights

Download models using huggingface-cli:

huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./weights/Wan2.1-I2V-14B-480P
huggingface-cli download TencentGameMate/chinese-wav2vec2-base --local-dir ./weights/chinese-wav2vec2-base
huggingface-cli download TencentGameMate/chinese-wav2vec2-base model.safetensors --revision refs/pr/1 --local-dir ./weights/chinese-wav2vec2-base
huggingface-cli download hexgrad/Kokoro-82M --local-dir ./weights/Kokoro-82M
huggingface-cli download MeiGen-AI/MeiGen-MultiTalk --local-dir ./weights/MeiGen-MultiTalk

MultiTalk

Install / Use

README