SpeechT5
Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
Install / Use
/learn @microsoft/SpeechT5README
SpeechT5
Unified-modal speech-text pre-training for spoken language processing:
SpeechT5 (
ACL 2022): SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing
Speech2C (
INTERSPEECH 2022): Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data
YiTrans (
IWSLT 2022): The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task
SpeechUT (
EMNLP 2022): SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training
SpeechLM (
IEEE/ACM TASLP): SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data
Speech2S (
ICASSP 2023): Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation
Prosody-SpeechT5 (
ICASSP 2023): Prosody-aware SpeechT5 for Expressive Neural TTS
VATLM (
IEEE Transactions on Multimedia): VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning
VALL-E X (
Arxiv 2023): Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling
VioLA (
Arxiv 2023): VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
<!-- Model introductions, evaluation results, and model inference instructions are located in the corresponding folders. The source code is [https://github.com/microsoft/SpeechT5/tree/main/ModelName]. -->WavLLM (
Arxiv 2024): WavLLM: Towards Robust and Adaptive Speech Large Language Model
Update
- April, 2024: WavLLM Arxiv.
- March, 2024: SpeechLM was accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing.
- May, 2023: VioLA Arxiv.
- May, 2023: VATLM was accepted by IEEE Transactions on Multimedia.
- March, 2023: VALL-E X Arxiv and Demo.
- February, 2023: Speech2S and Prosody-SpeechT5 were accepted by ICASSP 2023.
- [HuggingFace Integration] February, 2023: SpeechT5 models are on HuggingFace.
- [Model Release] November, 2022: VATLM models are released.
- November, 2022: VATLM Arxiv.
- November, 2022: Speech2S Arxiv.
- [Model Release] October, 2022: SpeechUT models are released.
- October, 2022: SpeechUT was accepted by EMNLP 2022.
- [Model Release] October, 2022: SpeechLM models are released.
- September, 2022: SpeechLM Arxiv.
- [Evaluation] June, 2022: The end-to-end ST system YiTrans achieved top results on IWSLT 2022 shared tasks.
- June, 2022: Speech2C was accepted by InterSpeech 2022.
- [Model Release] May, 2022: Speech2C models are released.
- [Model Release] April, 2022: SpeechT5 models are released.
- March, 2022: Speech2C Arxiv.
- February, 2022: SpeechT5 was accepted by ACL 2022.
- October, 2021: SpeechT5 Arxiv.
Pre-Trained Models
| Model | Pre-training Dataset | Fine-tuning Dataset | Model | | :------: | :----------------------------------------------: | :-----------------: | :-----: | | SpeechT5 Base | 960 hrs LibriSpeech + LibriSpeech LM Dataset | - | HuggingFace<br /> Google Drive | | SpeechT5 Base | 960 hrs LibriSpeech + LibriSpeech LM Dataset | 100 hrs LibriSpeech | HuggingFace<br /> Google Drive | | SpeechT5 Large | 60k hrs Libri-Light + LibriSpeech LM Dataset | - | Google Drive | | Speech2C | 960 hrs LibriSpeech | - | Google Drive | | Speech2C | 960 hrs LibriSpeech | 10 hrs LibriSpeech | Google Drive | | Speech2C | 960 hrs LibriSpeech | 100 hrs LibriSpeech | Google Drive | | SpeechLM-P Base | 960 hrs LibriSpeech + 40M Text | - | Google drive | | SpeechLM-P Base | 960 hrs LibriSpeech + 40M Text | 100 hrs LibriSpeech | Google drive | | SpeechLM-H Base | 960 hrs LibriSpeech + 40M Text | - | Google drive | | SpeechLM-H Base | 960 hrs LibriSpeech + 40M Text | 100 hrs LibriSpeech | Google drive | | SpeechLM-P Base | 960 hrs LibriSpeech + 40M Text | En-De CoVoST-2 | [Azure Storage] | | SpeechLM-P Base | 960 hrs LibriSpeech + 40M Text | En-Ca CoVoST-2 | [Azure Storage] | | SpeechLM-P Base | 960 hrs LibriSpeech + 40M Text | En-Ar CoVoST-2 | [Azure Storage] | | SpeechLM-P Base | 960 hrs LibriSpeech + 40M Text | En-Tr CoVoST-2 | [Azure Storage] | | SpeechLM-P Large | 60k hrs LibriLight + 40M Text | - | Google drive | | SpeechLM-P Large | 60k hrs LibriLight + 40M Text | 960 hrs LibriSpeech | Google drive | | SpeechLM-P Large | 60k hrs LibriLight + 40M Text | En-De CoVoST-2 | Google drive | | SpeechLM-P Large | 60k hrs LibriLight + 40M Text | En-Ca CoVoST-2 | Google drive | | SpeechLM-P Large | 60k hrs LibriLight + 40M Text | En-Ar CoVoST-2 | Google drive | | SpeechLM-P Large | 60k hrs LibriLight + 40M Text |
Related Skills
node-connect
343.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
92.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
