SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing

Generate Convert Improve

Install / Use

/learn @microsoft/SpeechT5

About this skill

Quality Score

0/100

README

SpeechT5

Unified-modal speech-text pre-training for spoken language processing:

SpeechT5 (ACL 2022): SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing

Speech2C (INTERSPEECH 2022): Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

YiTrans (IWSLT 2022): The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task

SpeechUT (EMNLP 2022): SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

SpeechLM (IEEE/ACM TASLP): SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Speech2S (ICASSP 2023): Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

Prosody-SpeechT5 (ICASSP 2023): Prosody-aware SpeechT5 for Expressive Neural TTS

VATLM (IEEE Transactions on Multimedia): VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

VALL-E X (Arxiv 2023): Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

VioLA (Arxiv 2023): VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

WavLLM (Arxiv 2024): WavLLM: Towards Robust and Adaptive Speech Large Language Model

Update

April, 2024: WavLLM Arxiv.
March, 2024: SpeechLM was accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing.
May, 2023: VioLA Arxiv.
May, 2023: VATLM was accepted by IEEE Transactions on Multimedia.
March, 2023: VALL-E X Arxiv and Demo.
February, 2023: Speech2S and Prosody-SpeechT5 were accepted by ICASSP 2023.
[HuggingFace Integration] February, 2023: SpeechT5 models are on HuggingFace.
[Model Release] November, 2022: VATLM models are released.
November, 2022: VATLM Arxiv.
November, 2022: Speech2S Arxiv.
[Model Release] October, 2022: SpeechUT models are released.
October, 2022: SpeechUT was accepted by EMNLP 2022.
[Model Release] October, 2022: SpeechLM models are released.
September, 2022: SpeechLM Arxiv.
[Evaluation] June, 2022: The end-to-end ST system YiTrans achieved top results on IWSLT 2022 shared tasks.
June, 2022: Speech2C was accepted by InterSpeech 2022.
[Model Release] May, 2022: Speech2C models are released.
[Model Release] April, 2022: SpeechT5 models are released.
March, 2022: Speech2C Arxiv.
February, 2022: SpeechT5 was accepted by ACL 2022.
October, 2021: SpeechT5 Arxiv.

Pre-Trained Models

Related Skills

node-connect

343.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

92.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

343.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

343.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。