SkillAgentSearch skills...

MultiTalk

[NeurIPS 2025] Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation

Install / Use

/learn @MeiGen-AI/MultiTalk
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <p align="center"> <img src="assets/logo2.jpeg" alt="MultiTalk" width="240"/> </p> <h1>Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation (NeurIPS 2025)</h1>

Zhe Kong* · Feng Gao* ·Yong Zhang<sup></sup> · Zhuoliang Kang · Xiaoming Wei · Xunliang Cai

Guanying Chen · Wenhan Luo<sup></sup>

<sup>*</sup>Equal Contribution <sup></sup>Corresponding Authors

<a href='https://meigen-ai.github.io/multi-talk/'><img src='https://img.shields.io/badge/Project-Page-green'></a> <a href='https://arxiv.org/abs/2505.22647'><img src='https://img.shields.io/badge/Technique-Report-red'></a> <a href='https://huggingface.co/MeiGen-AI/MeiGen-MultiTalk'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>

</div>

TL; DR: MultiTalk is an audio-driven multi-person conversational video generation​​. It enables the video creation of multi-person conversation 💬, singing 🎤, interaction control 👬, and cartoon 🙊.

<p align="center"> <img src="assets/pipe.png"> </p>

Video Demos

<table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> <tr> <td> <video src="https://github.com/user-attachments/assets/e55952e6-e1b2-44a5-9887-a89307a378da" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/f0396c19-d459-42aa-9d78-34fdea10de18" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/3576fd04-3e5f-4933-ac7b-1c4e6a601379" width="320" controls loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/5589056e-3202-442d-a62a-2cad7a7ecb19" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/554bfbe7-0090-492c-94be-329f5e39e175" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/9e961f35-9413-4846-a806-8186d54061da" width="320" controls loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/342595ab-cf75-4872-8182-f20fe8c95611" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/6476f9f0-35e0-4484-91a4-8aa646aa994a" width="320" controls loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/d8fc8e94-0cba-4c25-9f3a-a8d7e0a785e1" width="320" controls loop></video> </td> </tr> </table>

✨ Key Features

We propose MultiTalk , a novel framework for audio-driven multi-person conversational video generation. Given a multi-stream audio input, a reference image and a prompt, MultiTalk generates a video containing interactions following the prompt, with consistent lip motions aligned with the audio.

  • 💬 ​​Realistic Conversations​​ - Support single & multi-person generation
  • 👥 ​​​​Interactive Character Control​​​​ - Direct virtual humans via prompts
  • 🎤 ​​​​Generalization Performances​​​​ - Support the generation of cartoon character and singing
  • 📺 ​​​​Resolution Flexibility​​​​: 480p & 720p output at arbitrary aspect ratios
  • ⏱️ Long Video Generation: Support video generation up to 15 seconds

🔥 Latest News

  • Dec 16, 2025: 🚀 We are excited to announce the release of LongCat-Video-Avatar, a unified model that delivers expressive and highly dynamic audio-driven character animation, supporting native tasks including Audio-Text-to-Video, Audio-Text-Image-to-Video, and Video Continuation with seamless compatibility for both single-stream and multi-stream audio inputs. The release includes our Technical Report, code, model weights, and project page.
  • Aug 19, 2025: 🔥🔥 We released InfiniteTalk, a novel new paradigm for video dubbing. InfiniteTalk supports infinite-length video-to-video generation and image-to-video generation. Models, code, gradio, and comfyui have all been released.
  • July 11, 2025: 🔥🔥 MultiTalk supports INT8 quantization and SageAttention2.2, and updates the CFG strategy (2 NFE per step) for FusionX LoRA,
  • July 01, 2025: 🔥🔥 MultiTalk supports input audios with TTS, FusioniX and lightx2v LoRA acceleration (requires only 4~8 steps), and Gradio.
  • June 14, 2025: 🔥🔥 We release MultiTalk with support for multi-GPU inference, teacache acceleration, APG and low-VRAM inference (enabling 480P video generation on a single RTX 4090). APG is used to alleviate the color error accumulation in long video generation. TeaCache is capable of increasing speed by approximately 2~3x.
  • June 9, 2025: 🔥🔥 We release the weights and inference code of MultiTalk
  • May 29, 2025: We release the Technique-Report of MultiTalk
  • May 29, 2025: We release the project page of MultiTalk

🌐 Community Works

  • Wan2GP: thank deepbeepmeep for providing the project Wan2GP that enables Multitalk on very low VRAM hardware (8 GB of VRAM) and combines it with the capabilities of Vace.
  • Replicate: thank zsxkib for pushing MultiTalk to Replicate platform, try it! Please refer to cog-MultiTalk for details.
  • Gradio Demo: thank fffiloni for developing this gradio demo on Hugging Face. Please refer to the issue for details.
  • ComfyUI: thank kijai for integrating MultiTalk into ComfyUI-WanVideoWrapper. Rudra found something interesting that MultiTalk can be combined with Wanx T2V and VACE in the issue.
  • Google Colab example, an exmaple for inference on A100 provided by Braffolk.

📑 Todo List

  • [x] Release the technical report
  • [x] Inference
  • [x] Checkpoints
  • [x] Multi-GPU Inference
  • [ ] Inference acceleration
    • [x] TeaCache
    • [x] int8 quantization
    • [ ] LCM distillation
    • [ ] Sparse Attention
  • [x] Run with very low VRAM
  • [x] TTS integration
  • [x] Gradio demo
  • [ ] ComfyUI
  • [ ] 1.3B model

Quick Start

🛠️Installation

1. Create a conda environment and install pytorch, xformers

conda create -n multitalk python=3.10
conda activate multitalk
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install -U xformers==0.0.28 --index-url https://download.pytorch.org/whl/cu121

2. Flash-attn installation:

pip install misaki[en]
pip install ninja 
pip install psutil 
pip install packaging 
pip install flash_attn==2.7.4.post1

3. Other dependencies

pip install -r requirements.txt
conda install -c conda-forge librosa

4. FFmeg installation

conda install -c conda-forge ffmpeg

or

sudo yum install ffmpeg ffmpeg-devel

🧱Model Preparation

1. Model Download

| Models | Download Link | Notes | | --------------|-------------------------------------------------------------------------------|-------------------------------| | Wan2.1-I2V-14B-480P | 🤗 Huggingface | Base model | chinese-wav2vec2-base | 🤗 Huggingface | Audio encoder | Kokoro-82M | 🤗 Huggingface | TTS weights | MeiGen-MultiTalk | 🤗 Huggingface | Our audio condition weights

Download models using huggingface-cli:

huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./weights/Wan2.1-I2V-14B-480P
huggingface-cli download TencentGameMate/chinese-wav2vec2-base --local-dir ./weights/chinese-wav2vec2-base
huggingface-cli download TencentGameMate/chinese-wav2vec2-base model.safetensors --revision refs/pr/1 --local-dir ./weights/chinese-wav2vec2-base
huggingface-cli download hexgrad/Kokoro-82M --local-dir ./weights/Kokoro-82M
huggingface-cli download MeiGen-AI/MeiGen-MultiTalk --local-dir ./weights/MeiGen-MultiTalk
View on GitHub
GitHub Stars2.9k
CategoryContent
Updated12h ago
Forks480

Languages

Python

Security Score

95/100

Audited on Apr 7, 2026

No findings