SkillAgentSearch skills...

SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing

Install / Use

/learn @microsoft/SpeechT5

README

SpeechT5

Unified-modal speech-text pre-training for spoken language processing:

SpeechT5 (ACL 2022): SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing

Speech2C (INTERSPEECH 2022): Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

YiTrans (IWSLT 2022): The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task

SpeechUT (EMNLP 2022): SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

SpeechLM (IEEE/ACM TASLP): SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Speech2S (ICASSP 2023): Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

Prosody-SpeechT5 (ICASSP 2023): Prosody-aware SpeechT5 for Expressive Neural TTS

VATLM (IEEE Transactions on Multimedia): VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

VALL-E X (Arxiv 2023): Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

VioLA (Arxiv 2023): VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

WavLLM (Arxiv 2024): WavLLM: Towards Robust and Adaptive Speech Large Language Model

<!-- Model introductions, evaluation results, and model inference instructions are located in the corresponding folders. The source code is [https://github.com/microsoft/SpeechT5/tree/main/ModelName]. -->

Update

  • April, 2024: WavLLM Arxiv.
  • March, 2024: SpeechLM was accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  • May, 2023: VioLA Arxiv.
  • May, 2023: VATLM was accepted by IEEE Transactions on Multimedia.
  • March, 2023: VALL-E X Arxiv and Demo.
  • February, 2023: Speech2S and Prosody-SpeechT5 were accepted by ICASSP 2023.
  • [HuggingFace Integration] February, 2023: SpeechT5 models are on HuggingFace.
  • [Model Release] November, 2022: VATLM models are released.
  • November, 2022: VATLM Arxiv.
  • November, 2022: Speech2S Arxiv.
  • [Model Release] October, 2022: SpeechUT models are released.
  • October, 2022: SpeechUT was accepted by EMNLP 2022.
  • [Model Release] October, 2022: SpeechLM models are released.
  • September, 2022: SpeechLM Arxiv.
  • [Evaluation] June, 2022: The end-to-end ST system YiTrans achieved top results on IWSLT 2022 shared tasks.
  • June, 2022: Speech2C was accepted by InterSpeech 2022.
  • [Model Release] May, 2022: Speech2C models are released.
  • [Model Release] April, 2022: SpeechT5 models are released.
  • March, 2022: Speech2C Arxiv.
  • February, 2022: SpeechT5 was accepted by ACL 2022.
  • October, 2021: SpeechT5 Arxiv.

Pre-Trained Models

| Model | Pre-training Dataset | Fine-tuning Dataset | Model | | :------: | :----------------------------------------------: | :-----------------: | :-----: | | SpeechT5 Base | 960 hrs LibriSpeech + LibriSpeech LM Dataset | - | HuggingFace<br /> Google Drive | | SpeechT5 Base | 960 hrs LibriSpeech + LibriSpeech LM Dataset | 100 hrs LibriSpeech | HuggingFace<br /> Google Drive | | SpeechT5 Large | 60k hrs Libri-Light + LibriSpeech LM Dataset | - | Google Drive | | Speech2C | 960 hrs LibriSpeech | - | Google Drive | | Speech2C | 960 hrs LibriSpeech | 10 hrs LibriSpeech | Google Drive | | Speech2C | 960 hrs LibriSpeech | 100 hrs LibriSpeech | Google Drive | | SpeechLM-P Base | 960 hrs LibriSpeech + 40M Text | - | Google drive | | SpeechLM-P Base | 960 hrs LibriSpeech + 40M Text | 100 hrs LibriSpeech | Google drive | | SpeechLM-H Base | 960 hrs LibriSpeech + 40M Text | - | Google drive | | SpeechLM-H Base | 960 hrs LibriSpeech + 40M Text | 100 hrs LibriSpeech | Google drive | | SpeechLM-P Base | 960 hrs LibriSpeech + 40M Text | En-De CoVoST-2 | [Azure Storage] | | SpeechLM-P Base | 960 hrs LibriSpeech + 40M Text | En-Ca CoVoST-2 | [Azure Storage] | | SpeechLM-P Base | 960 hrs LibriSpeech + 40M Text | En-Ar CoVoST-2 | [Azure Storage] | | SpeechLM-P Base | 960 hrs LibriSpeech + 40M Text | En-Tr CoVoST-2 | [Azure Storage] | | SpeechLM-P Large | 60k hrs LibriLight + 40M Text | - | Google drive | | SpeechLM-P Large | 60k hrs LibriLight + 40M Text | 960 hrs LibriSpeech | Google drive | | SpeechLM-P Large | 60k hrs LibriLight + 40M Text | En-De CoVoST-2 | Google drive | | SpeechLM-P Large | 60k hrs LibriLight + 40M Text | En-Ca CoVoST-2 | Google drive | | SpeechLM-P Large | 60k hrs LibriLight + 40M Text | En-Ar CoVoST-2 | Google drive | | SpeechLM-P Large | 60k hrs LibriLight + 40M Text |

Related Skills

View on GitHub
GitHub Stars1.4k
CategoryDevelopment
Updated1d ago
Forks134

Languages

Python

Security Score

100/100

Audited on Mar 30, 2026

No findings