SkillAgentSearch skills...

WavChat

A Survey of Spoken Dialogue Models (60 pages)

Install / Use

/learn @jishengpeng/WavChat

README

<div align='center'> <img src="https://cdn.jsdelivr.net/gh/MYJOKERML/imgbed/matebook14/image-20241111160012489.png" alt="image-20241111160012489" style="zoom: 30%;" /> </div>

🚀Quick Start

  1. Introduction
  2. Overall
  3. Representations of Spoken Dialogue Models
  4. Training Paradigm of Spoken Dialogue Model
  5. Streaming, Duplex, and Interaction
  6. Training Resources and Evaluation
  7. Cite

🔥What's new

  • 2024.11.22: We release WavChat (A survey of spoken dialogue models about 60 pages) on arxiv! 🎉
  • 2024.08.31: We release WavTokenizer on arxiv.

Introduction

This repository is the official repository of the WavChat: A Survey of Spoken Dialogue Models Paper page.

<div align='center'> <img src="https://cdn.jsdelivr.net/gh/MYJOKERML/imgbed/matebook14/image-20241112151419833.png" alt="img1-paper-list" style="zoom: 20%;" />

Figure 1: The timeline of existing spoken dialogue models in recent years.

</div>

Abstract

Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. In the broader context of multimodal models, the speech modality offers a direct interface for human-computer interaction, enabling direct communication between AI and users. Compared to traditional three-tier cascaded spoken dialogue models that comprise speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS), modern spoken dialogue models exhibit greater intelligence. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Moreover, they generate high-quality, multi-turn speech responses with low latency, enabling real-time interaction through simultaneous listening and speaking capability. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems and the underlying technologies. To address this, we have first compiled existing spoken dialogue systems in the chronological order and categorized them into the cascaded and end-to-end paradigms. We then provide an in-depth overview of the core technologies in spoken dialogue models, covering aspects such as speech representation, training paradigm, streaming, duplex, and interaction capabilities. Each section discusses the limitations of these technologies and outlines considerations for future research. Additionally, we present a thorough review of relevant datasets, evaluation metrics, and benchmarks from the perspectives of training and evaluating spoken dialogue systems. We hope this survey will contribute to advancing both academic research and industrial applications in the field of spoken dialogue systems.

Overall

1. The organization of this survey

<div align='center'> <img src="https://cdn.jsdelivr.net/gh/MYJOKERML/imgbed/matebook14/img4-framework-v7_00.png" alt="WavChat - 副本" style="zoom:16%;" />

Figure 2: Orgnization of this survey.

</div>

2. General classification of spoken dialogue systems

<div align='center'> <img src="https://cdn.jsdelivr.net/gh/MYJOKERML/imgbed/matebook14/image-20241112151732341.png" alt="img2-method" style="zoom: 30%;" />

Figure 3: A general overview of current spoken dialogue systems.

</div>

3. Key capabilities of speech dialogue systems

<div align='center'> <img src="https://cdn.jsdelivr.net/gh/MYJOKERML/imgbed/matebook14/image-20241111165006367.png" alt="image-20241111165006367" style="zoom: 25%;" />

Figure 4: An overview of the spoken dialogue systems' nine ideal capabilities.

</div>

4. Publicly Available Speech Dialogue Models

<div align='center'> <table> <thead> <tr> <th>Model</th> <th>URL</th> </tr> </thead> <tbody> <tr> <td>AudioGPT</td> <td><a href="https://github.com/AIGC-Audio/AudioGPT">https://github.com/AIGC-Audio/AudioGPT</a></td> </tr> <tr> <td>SpeechGPT</td> <td><a href="https://github.com/0nutation/SpeechGPT">https://github.com/0nutation/SpeechGPT</a></td> </tr> <tr> <td>Freeze-Omni</td> <td><a href="https://github.com/VITA-MLLM/Freeze-Omni">https://github.com/VITA-MLLM/Freeze-Omni</a></td> </tr> <tr> <td>Baichuan-Omni</td> <td><a href="https://github.com/westlake-baichuan-mllm/bc-omni">https://github.com/westlake-baichuan-mllm/bc-omni</a></td> </tr> <tr> <td>GLM-4-Voice</td> <td><a href="https://github.com/THUDM/GLM-4-Voice">https://github.com/THUDM/GLM-4-Voice</a></td> </tr> <tr> <td>Mini-Omni</td> <td><a href="https://github.com/gpt-omni/mini-omni">https://github.com/gpt-omni/mini-omni</a></td> </tr> <tr> <td>Mini-Omni2</td> <td><a href="https://github.com/gpt-omni/mini-omni2">https://github.com/gpt-omni/mini-omni2</a></td> </tr> <tr> <td>FunAudioLLM</td> <td><a href="https://github.com/FunAudioLLM">https://github.com/FunAudioLLM</a></td> </tr> <tr> <td>Qwen-Audio</td> <td><a href="https://github.com/QwenLM/Qwen-Audio">https://github.com/QwenLM/Qwen-Audio</a></td> </tr> <tr> <td>Qwen2-Audio</td> <td><a href="https://github.com/QwenLM/Qwen2-Audio">https://github.com/QwenLM/Qwen2-Audio</a></td> </tr> <tr> <td>LLaMA3.1</td> <td><a href="https://www.llama.com">https://www.llama.com</a></td> </tr> <tr> <td>Audio Flamingo</td> <td><a href="https://github.com/NVIDIA/audio-flamingo">https://github.com/NVIDIA/audio-flamingo</a></td> </tr> <tr> <td>Ultravox</td> <td><a href="https://github.com/fixie-ai/ultravox">https://github.com/fixie-ai/ultravox</a></td> </tr> <tr> <td>Spirit LM</td> <td><a href="https://github.com/facebookresearch/spiritlm">https://github.com/facebookresearch/spiritlm</a></td> </tr> <tr> <td>dGSLM</td> <td><a href="https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/dgslm">https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/dgslm</a></td> </tr> <tr> <td>Spoken-LLM</td> <td><a href="https://arxiv.org/abs/2305.11000">https://arxiv.org/abs/2305.11000</a></td> </tr> <tr> <td>LLaMA-Omni</td> <td><a href="https://github.com/ictnlp/LLaMA-Omni">https://github.com/ictnlp/LLaMA-Omni</a></td> </tr> <tr> <td>Moshi</td> <td><a href="https://github.com/kyutai-labs/moshi">https://github.com/kyutai-labs/moshi</a></td> </tr> <tr> <td>SALMONN</td> <td><a href="https://github.com/bytedance/SALMONN">https://github.com/bytedance/SALMONN</a></td> </tr> <tr> <td>LTU-AS</td> <td><a href="https://github.com/YuanGongND/ltu">https://github.com/YuanGongND/ltu</a></td> </tr> <tr> <td>VITA</td> <td><a href="https://github.com/VITA-MLLM/VITA">https://github.com/VITA-MLLM/VITA</a></td> </tr> <tr> <td>SpeechGPT-Gen</td> <td><a href="https://github.com/0nutation/SpeechGPT">https://github.com/0nutation/SpeechGPT</a></td> </tr> <tr> <td>WavLLM</td> <td><a href="https://github.com/microsoft/SpeechT5/tree/main/WavLLM">https://github.com/microsoft/SpeechT5/tree/main/WavLLM</a></td> </tr> <tr> <td>Westlake-Omni</td> <td><a href="https://github.com/xinchen-ai/Westlake-Omni">https://github.com/xinchen-ai/Westlake-Omni</a></td> </tr> <tr> <td>MooER-Omni</td> <td><a href="https://github.com/MooreThreads/MooER">https://github.com/MooreThreads/MooER</a></td> </tr> <tr> <td>Hertz-dev</td> <td><a href="https://github.com/Standard-Intelligence/hertz-dev">https://github.com/Standard-Intelligence/hertz-dev</a></td> </tr> <tr> <td>Fish-Agent</td> <td><a href="https://github.com/fishaudio/fish-speech">https://github.com/fishaudio/fish-speech</a></td> </tr> <tr> <td>SpeechGPT2</td> <td><a href="https://0nutation.github.io/SpeechGPT2.github.io/">https://0nutation.github.io/SpeechGPT2.github.io/</a></td> </tr> </tbody> </table>

Table 1: The list of publicly available speech dialogue models and their URL

</div>

Representations of Spoken Dialogue Models

In the section Representations of Spoken Dialogue Models, we provide insights into how to represent the data in a speech dialogue model for better understanding and generation of speech. The choice of representation method dir

Related Skills

View on GitHub
GitHub Stars316
CategoryDevelopment
Updated4d ago
Forks18

Security Score

85/100

Audited on Mar 25, 2026

No findings