StreamSpeech

StreamSpeech is an “All in One” seamless model for offline and simultaneous speech recognition, speech translation and speech synthesis.

Generate Convert Improve

Install / Use

/learn @ictnlp/StreamSpeech

About this skill

Quality Score

0/100

README

StreamSpeech

Authors: Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min Zhang, Yang Feng*

Code for ACL 2024 paper "StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning".

<img src="./assets/streamspeech.png" alt="StreamSpeech" style="width: 70%; min-width: 300px; display: block; margin: auto;"> 🎧 Listen to <a href="https://ictnlp.github.io/StreamSpeech-site/">StreamSpeech's translated speech</a> 🎧

💡Highlight:

StreamSpeech achieves SOTA performance on both offline and simultaneous speech-to-speech translation.
StreamSpeech performs streaming ASR, simultaneous speech-to-text translation and simultaneous speech-to-speech translation via an "All in One" seamless model.
StreamSpeech can present intermediate results (i.e., ASR or translation results) during simultaneous translation, offering a more comprehensive low-latency communication experience.

🔥News

[2025.06.17] We are excited to extend the "All-in-One" feature of StreamSpeech to more general multimodal interactions via developing Stream-Omni. 👉Refer to paper, code & demo, model for more details.
- Stream-Omni is an GPT-4o-like language-vision-speech chatbot that simultaneously supports interactions across any combination of text, vision, and speech modalities.
- Stream-Omni can simultaneously produce intermediate textual results (e.g., ASR transcriptions and model responses) during speech interactions, like the advanced voice service of GPT-4o.
[2024.06.17] Add Web GUI demo, now you can experience StreamSpeech in your local browser.
[2024.06.05] Paper, code, models and demo of StreamSpeech are available!

⭐Features

Support 8 Tasks

Offline: Speech Recognition (ASR)✅, Speech-to-Text Translation (S2TT)✅, Speech-to-Speech Translation (S2ST)✅, Speech Synthesis (TTS)✅
Simultaneous: Streaming ASR✅, Simultaneous S2TT✅, Simultaneous S2ST✅, Real-time TTS✅ under any latency (with one model)

GUI Demo

https://github.com/ictnlp/StreamSpeech/assets/34680227/4d9bdabf-af66-4320-ae7d-0f23e721cd71

Simultaneously provide ASR, translation, and synthesis results via a seamless model

Case

Speech Input: example/wavs/common_voice_fr_17301936.mp3

Transcription (ground truth): jai donc lexpérience des années passées jen dirai un mot tout à lheure

Translation (ground truth): i therefore have the experience of the passed years i'll say a few words about that later

| StreamSpeech | Simultaneous | Offline | | ----------------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | | Speech Recognition | jai donc expérience des années passé jen dirairai un mot tout à lheure | jai donc lexpérience des années passé jen dirairai un mot tout à lheure | | Speech-to-Text Translation | i therefore have an experience of last years i will tell a word later | so i have the experience in the past years i'll say a word later | | Speech-to-Speech Translation | <video src='https://github.com/zhangshaolei1998/StreamSpeech_dev/assets/34680227/ed41ba13-353b-489b-acfa-85563d0cc2cb' width="30%"/> | <video src='https://github.com/zhangshaolei1998/StreamSpeech_dev/assets/34680227/ca482ba6-76da-4619-9dfd-24aa2eb3339a' width="30%"/> | | Text-to-Speech Synthesis (incrementally synthesize speech word by word) | <video src='https://github.com/zhangshaolei1998/StreamSpeech_dev/assets/34680227/294f1310-eace-4914-be30-5cd798e8592e' width="30%"/> | <video src='https://github.com/zhangshaolei1998/StreamSpeech_dev/assets/34680227/52854163-7fc5-4622-a5a6-c133cbd99e58' width="30%"/> |

⚙Requirements

Python == 3.10, PyTorch == 2.0.1, Install fairseq & SimulEval

cd fairseq
pip install --editable ./ --no-build-isolation
cd SimulEval
pip install --editable ./

🚀Quick Start

1. Model Download

(1) StreamSpeech Models

| Language | UnitY | StreamSpeech (offline) | StreamSpeech (simultaneous) | | -------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | | Fr-En | unity.fr-en.pt [Huggingface] [Baidu] | streamspeech.offline.fr-en.pt [Huggingface] [Baidu] | streamspeech.simultaneous.fr-en.pt [Huggingface] [Baidu] | | Es-En | unity.es-en.pt [Huggingface] [Baidu] | streamspeech.offline.es-en.pt [Huggingface] [Baidu] | streamspeech.simultaneous.es-en.pt [Huggingface] [Baidu] | | De-En | unity.de-en.pt [Huggingface] [Baidu] | streamspeech.offline.de-en.pt [Huggingface] [Baidu] | streamspeech.simultaneous.de-en.pt [Huggingface] [Baidu] |

(2) Unit-based HiFi-GAN Vocoder

| Unit config | Unit size | Vocoder language | Dataset | Model | | ----------------- | --------- | ---------------- | --------------------------------------------------- | ------------------------------------------------------------ | | mHuBERT, layer 11 | 1000 | En | LJSpeech | ckpt, config |

2. Prepare Data and Config (only for test/inference)

(1) Config Files

Replace /data/zhangshaolei/StreamSpeech in files configs/fr-en/config_gcmvn.yaml and configs/fr-en/config_mtl_asr_st_ctcst.yaml with your local address of StreamSpeech repo.

(2) Test Data

Prepare test data following SimulEval format. example/ provides an example:

wav_list.txt: Each line records the path of a source speech.
target.txt: Each line records the reference text, e.g., target translation or source transcription (used to calculate the metrics).

3. Inference with SimulEval

Run these scripts to inference StreamSpeech on streaming ASR, simultaneous S2TT and simultaneous S2ST.

--source-segment-size: set the chunk size (millisecond) to any value to control the latency

<details> <summary>Simultaneo

Related Skills

node-connect

347.6k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

108.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

docs-writer

100.2k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

string-reviewer

100.2k

ictnlp

View profile

View on GitHub

GitHub Stars1.3k

CategoryDevelopment

Updated3d ago

Forks101

ictnlp/StreamSpeech

Languages

Python

Security Score

100/100

Audited on Apr 1, 2026

No findings