SkillAgentSearch skills...

StreamSpeech

StreamSpeech is an “All in One” seamless model for offline and simultaneous speech recognition, speech translation and speech synthesis.

Install / Use

/learn @ictnlp/StreamSpeech

README

StreamSpeech

arXiv project model Hits

twitter twitter

Authors: Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min Zhang, Yang Feng*

Code for ACL 2024 paper "StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning".

<p align="center" width="100%"> <img src="./assets/streamspeech.png" alt="StreamSpeech" style="width: 70%; min-width: 300px; display: block; margin: auto;"> </p> <p align="center"> 🎧 Listen to <a href="https://ictnlp.github.io/StreamSpeech-site/">StreamSpeech's translated speech</a> 🎧 </p>

💡Highlight:

  1. StreamSpeech achieves SOTA performance on both offline and simultaneous speech-to-speech translation.
  2. StreamSpeech performs streaming ASR, simultaneous speech-to-text translation and simultaneous speech-to-speech translation via an "All in One" seamless model.
  3. StreamSpeech can present intermediate results (i.e., ASR or translation results) during simultaneous translation, offering a more comprehensive low-latency communication experience.

🔥News

  • [2025.06.17] We are excited to extend the "All-in-One" feature of StreamSpeech to more general multimodal interactions via developing Stream-Omni. 👉Refer to paper, code & demo, model for more details.

    • Stream-Omni is an GPT-4o-like language-vision-speech chatbot that simultaneously supports interactions across any combination of text, vision, and speech modalities.
    • Stream-Omni can simultaneously produce intermediate textual results (e.g., ASR transcriptions and model responses) during speech interactions, like the advanced voice service of GPT-4o.
  • [2024.06.17] Add Web GUI demo, now you can experience StreamSpeech in your local browser.

  • [2024.06.05] Paper, code, models and demo of StreamSpeech are available!

⭐Features

Support 8 Tasks

  • Offline: Speech Recognition (ASR)✅, Speech-to-Text Translation (S2TT)✅, Speech-to-Speech Translation (S2ST)✅, Speech Synthesis (TTS)✅
  • Simultaneous: Streaming ASR✅, Simultaneous S2TT✅, Simultaneous S2ST✅, Real-time TTS✅ under any latency (with one model)

GUI Demo

https://github.com/ictnlp/StreamSpeech/assets/34680227/4d9bdabf-af66-4320-ae7d-0f23e721cd71

<p align="center"> Simultaneously provide ASR, translation, and synthesis results via a seamless model </p>

Case

Speech Input: example/wavs/common_voice_fr_17301936.mp3

Transcription (ground truth): jai donc lexpérience des années passées jen dirai un mot tout à lheure

Translation (ground truth): i therefore have the experience of the passed years i'll say a few words about that later

| StreamSpeech | Simultaneous | Offline | | ----------------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | | Speech Recognition | jai donc expérience des années passé jen dirairai un mot tout à lheure | jai donc lexpérience des années passé jen dirairai un mot tout à lheure | | Speech-to-Text Translation | i therefore have an experience of last years i will tell a word later | so i have the experience in the past years i'll say a word later | | Speech-to-Speech Translation | <video src='https://github.com/zhangshaolei1998/StreamSpeech_dev/assets/34680227/ed41ba13-353b-489b-acfa-85563d0cc2cb' width="30%"/> | <video src='https://github.com/zhangshaolei1998/StreamSpeech_dev/assets/34680227/ca482ba6-76da-4619-9dfd-24aa2eb3339a' width="30%"/> | | Text-to-Speech Synthesis (incrementally synthesize speech word by word) | <video src='https://github.com/zhangshaolei1998/StreamSpeech_dev/assets/34680227/294f1310-eace-4914-be30-5cd798e8592e' width="30%"/> | <video src='https://github.com/zhangshaolei1998/StreamSpeech_dev/assets/34680227/52854163-7fc5-4622-a5a6-c133cbd99e58' width="30%"/> |

⚙Requirements

  • Python == 3.10, PyTorch == 2.0.1, Install fairseq & SimulEval

    cd fairseq
    pip install --editable ./ --no-build-isolation
    cd SimulEval
    pip install --editable ./
    

🚀Quick Start

1. Model Download

(1) StreamSpeech Models

| Language | UnitY | StreamSpeech (offline) | StreamSpeech (simultaneous) | | -------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | | Fr-En | unity.fr-en.pt [Huggingface] [Baidu] | streamspeech.offline.fr-en.pt [Huggingface] [Baidu] | streamspeech.simultaneous.fr-en.pt [Huggingface] [Baidu] | | Es-En | unity.es-en.pt [Huggingface] [Baidu] | streamspeech.offline.es-en.pt [Huggingface] [Baidu] | streamspeech.simultaneous.es-en.pt [Huggingface] [Baidu] | | De-En | unity.de-en.pt [Huggingface] [Baidu] | streamspeech.offline.de-en.pt [Huggingface] [Baidu] | streamspeech.simultaneous.de-en.pt [Huggingface] [Baidu] |

(2) Unit-based HiFi-GAN Vocoder

| Unit config | Unit size | Vocoder language | Dataset | Model | | ----------------- | --------- | ---------------- | --------------------------------------------------- | ------------------------------------------------------------ | | mHuBERT, layer 11 | 1000 | En | LJSpeech | ckpt, config |

2. Prepare Data and Config (only for test/inference)

(1) Config Files

Replace /data/zhangshaolei/StreamSpeech in files configs/fr-en/config_gcmvn.yaml and configs/fr-en/config_mtl_asr_st_ctcst.yaml with your local address of StreamSpeech repo.

(2) Test Data

Prepare test data following SimulEval format. example/ provides an example:

  • wav_list.txt: Each line records the path of a source speech.
  • target.txt: Each line records the reference text, e.g., target translation or source transcription (used to calculate the metrics).

3. Inference with SimulEval

Run these scripts to inference StreamSpeech on streaming ASR, simultaneous S2TT and simultaneous S2ST.

--source-segment-size: set the chunk size (millisecond) to any value to control the latency

<details> <summary>Simultaneo

Related Skills

View on GitHub
GitHub Stars1.3k
CategoryDevelopment
Updated3d ago
Forks101

Languages

Python

Security Score

100/100

Audited on Apr 1, 2026

No findings