StreamSpeech
StreamSpeech is an “All in One” seamless model for offline and simultaneous speech recognition, speech translation and speech synthesis.
Install / Use
/learn @ictnlp/StreamSpeechREADME
StreamSpeech
Authors: Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min Zhang, Yang Feng*
Code for ACL 2024 paper "StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning".
<p align="center" width="100%"> <img src="./assets/streamspeech.png" alt="StreamSpeech" style="width: 70%; min-width: 300px; display: block; margin: auto;"> </p> <p align="center"> 🎧 Listen to <a href="https://ictnlp.github.io/StreamSpeech-site/">StreamSpeech's translated speech</a> 🎧 </p>💡Highlight:
- StreamSpeech achieves SOTA performance on both offline and simultaneous speech-to-speech translation.
- StreamSpeech performs streaming ASR, simultaneous speech-to-text translation and simultaneous speech-to-speech translation via an "All in One" seamless model.
- StreamSpeech can present intermediate results (i.e., ASR or translation results) during simultaneous translation, offering a more comprehensive low-latency communication experience.
🔥News
-
[2025.06.17] We are excited to extend the "All-in-One" feature of StreamSpeech to more general multimodal interactions via developing Stream-Omni. 👉Refer to paper, code & demo, model for more details.
- Stream-Omni is an GPT-4o-like language-vision-speech chatbot that simultaneously supports interactions across any combination of text, vision, and speech modalities.
- Stream-Omni can simultaneously produce intermediate textual results (e.g., ASR transcriptions and model responses) during speech interactions, like the advanced voice service of GPT-4o.
-
[2024.06.17] Add Web GUI demo, now you can experience StreamSpeech in your local browser.
-
[2024.06.05] Paper, code, models and demo of StreamSpeech are available!
⭐Features
Support 8 Tasks
- Offline: Speech Recognition (ASR)✅, Speech-to-Text Translation (S2TT)✅, Speech-to-Speech Translation (S2ST)✅, Speech Synthesis (TTS)✅
- Simultaneous: Streaming ASR✅, Simultaneous S2TT✅, Simultaneous S2ST✅, Real-time TTS✅ under any latency (with one model)
GUI Demo
https://github.com/ictnlp/StreamSpeech/assets/34680227/4d9bdabf-af66-4320-ae7d-0f23e721cd71
<p align="center"> Simultaneously provide ASR, translation, and synthesis results via a seamless model </p>Case
Speech Input: example/wavs/common_voice_fr_17301936.mp3
Transcription (ground truth): jai donc lexpérience des années passées jen dirai un mot tout à lheure
Translation (ground truth): i therefore have the experience of the passed years i'll say a few words about that later
| StreamSpeech | Simultaneous | Offline | | ----------------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | | Speech Recognition | jai donc expérience des années passé jen dirairai un mot tout à lheure | jai donc lexpérience des années passé jen dirairai un mot tout à lheure | | Speech-to-Text Translation | i therefore have an experience of last years i will tell a word later | so i have the experience in the past years i'll say a word later | | Speech-to-Speech Translation | <video src='https://github.com/zhangshaolei1998/StreamSpeech_dev/assets/34680227/ed41ba13-353b-489b-acfa-85563d0cc2cb' width="30%"/> | <video src='https://github.com/zhangshaolei1998/StreamSpeech_dev/assets/34680227/ca482ba6-76da-4619-9dfd-24aa2eb3339a' width="30%"/> | | Text-to-Speech Synthesis (incrementally synthesize speech word by word) | <video src='https://github.com/zhangshaolei1998/StreamSpeech_dev/assets/34680227/294f1310-eace-4914-be30-5cd798e8592e' width="30%"/> | <video src='https://github.com/zhangshaolei1998/StreamSpeech_dev/assets/34680227/52854163-7fc5-4622-a5a6-c133cbd99e58' width="30%"/> |
⚙Requirements
-
Python == 3.10, PyTorch == 2.0.1, Install fairseq & SimulEval
cd fairseq pip install --editable ./ --no-build-isolation cd SimulEval pip install --editable ./
🚀Quick Start
1. Model Download
(1) StreamSpeech Models
| Language | UnitY | StreamSpeech (offline) | StreamSpeech (simultaneous) | | -------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | | Fr-En | unity.fr-en.pt [Huggingface] [Baidu] | streamspeech.offline.fr-en.pt [Huggingface] [Baidu] | streamspeech.simultaneous.fr-en.pt [Huggingface] [Baidu] | | Es-En | unity.es-en.pt [Huggingface] [Baidu] | streamspeech.offline.es-en.pt [Huggingface] [Baidu] | streamspeech.simultaneous.es-en.pt [Huggingface] [Baidu] | | De-En | unity.de-en.pt [Huggingface] [Baidu] | streamspeech.offline.de-en.pt [Huggingface] [Baidu] | streamspeech.simultaneous.de-en.pt [Huggingface] [Baidu] |
(2) Unit-based HiFi-GAN Vocoder
| Unit config | Unit size | Vocoder language | Dataset | Model | | ----------------- | --------- | ---------------- | --------------------------------------------------- | ------------------------------------------------------------ | | mHuBERT, layer 11 | 1000 | En | LJSpeech | ckpt, config |
2. Prepare Data and Config (only for test/inference)
(1) Config Files
Replace /data/zhangshaolei/StreamSpeech in files configs/fr-en/config_gcmvn.yaml and configs/fr-en/config_mtl_asr_st_ctcst.yaml with your local address of StreamSpeech repo.
(2) Test Data
Prepare test data following SimulEval format. example/ provides an example:
- wav_list.txt: Each line records the path of a source speech.
- target.txt: Each line records the reference text, e.g., target translation or source transcription (used to calculate the metrics).
3. Inference with SimulEval
Run these scripts to inference StreamSpeech on streaming ASR, simultaneous S2TT and simultaneous S2ST.
<details> <summary>Simultaneo
--source-segment-size: set the chunk size (millisecond) to any value to control the latency
Related Skills
node-connect
347.6kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
docs-writer
100.2k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
string-reviewer
100.2k>
