ESPnet: end-to-end speech processing toolkit

|system/pytorch ver.|2.5.1|2.7.1|2.8.0|2.9.1| | :---- | :---: | :---: | :---: | :---: | |ubuntu/python3.10/pip||||| |ubuntu/python3.12/pip||||| |ubuntu/python3.10/conda|||| |debian12/python3.10/conda|||||| |windows/python3.10/pip|||||| |macos/python3.10/pip|||||| |macos/python3.10/conda||||||

Docs | Example (ESPnet2) | Docker | Notebook

</div>

ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding, and so on. ESPnet uses pytorch as a deep learning engine and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for various speech processing experiments.

Tutorial Series

2019 Tutorial at Interspeech
- Material
2021 Tutorial at CMU
- Online video
- Material
2022 Tutorial at CMU
- Usage of ESPnet (ASR as an example)
  - Online video
  - Material
- Add new models/tasks to ESPnet
  - Online video
  - Material

Key Features

Kaldi-style complete recipe

Support numbers of ASR recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, Gigaspeech, etc.)
Support numbers of TTS recipes in a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.)
Support numbers of ST recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.)
Support numbers of MT recipes (IWSLT'14, IWSLT'16, the above ST recipes etc.)
Support numbers of SLU recipes (CATSLU-MAPS, FSC, Grabo, IEMOCAP, JDCINAL, SNIPS, SLURP, SWBD-DA, etc.)
Support numbers of SE/SS recipes (DNS-IS2020, LibriMix, SMS-WSJ, VCTK-noisyreverb, WHAM!, WHAMR!, WSJ-2mix, etc.)
Support voice conversion recipe (VCC2020 baseline)
Support speaker diarization recipe (mini_librispeech, librimix)
Support singing voice synthesis recipe (ofuton_p_utagoe_db, opencpop, m4singer, etc.)

ASR: Automatic Speech Recognition

State-of-the-art performance in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
Hybrid CTC/attention based end-to-end ASR
- Fast/accurate training with CTC/attention multitask training
- CTC/attention joint decoding to boost monotonic alignment decoding
- Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU), Transformer, Conformer, Branchformer, or E-Branchformer
- Decoder: RNN (LSTM/GRU), Transformer, or S4
Attention: Flash Attention, Dot product, location-aware attention, variants of multi-head
Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
Batch GPU decoding
Data augmentation
Transducer based end-to-end ASR
- Architecture:
  - Custom encoder supporting RNNs, Conformer, Branchformer (w/ variants), 1D Conv / TDNN.
  - Decoder w/ parameters shared across blocks supporting RNN, stateless w/ 1D Conv, MEGA, and RWKV.
  - Pre-encoder: VGG2L or Conv2D available.
- Search algorithms:
  - Greedy search constrained to one emission by timestep.
  - Default beam search algorithm [Graves, 2012] without prefix search.
  - Alignment-Length Synchronous decoding [Saon et al., 2020].
  - Time Synchronous Decoding [Saon et al., 2020].
  - N-step Constrained beam search modified from [Kim et al., 2020].
  - modified Adaptive Expansion Search based on [Kim et al., 2021] and NSC.
- Features:
  - Unified interface for offline and streaming speech recognition.
  - Multi-task learning with various auxiliary losses:
    - Encoder: CTC, auxiliary Transducer and symmetric KL divergence.
    - Decoder: cross-entropy w/ label smoothing.
  - Transfer learning with an acoustic model and/or language model.
  - Training with FastEmit regularization method [Yu et al., 2021].
Please refer to the tutorial page for complete documentation.
CTC segmentation
Non-autoregressive model based on Mask-CTC
ASR examples for supporting endangered language documentation (Please refer to egs/puebla_nahuatl and egs/yoloxochitl_mixtec for details)
Wav2Vec2.0 pre-trained model as Encoder, imported from FairSeq.
Self-supervised learning representations as features, using upstream models in S3PRL in frontend.
- Set frontend to s3prl
- Select any upstream model by setting the frontend_conf to the corresponding name.
Transfer Learning :
- easy usage and transfers from models previously trained by your group or models from [ESPnet Hugging Face repository](https://huggingface.co/espn

Espnet

Install / Use

README

ESPnet: end-to-end speech processing toolkit

Tutorial Series

Key Features

Kaldi-style complete recipe

ASR: Automatic Speech Recognition