Espnet
End-to-End Speech Processing Toolkit
Install / Use
/learn @espnet/EspnetREADME
ESPnet: end-to-end speech processing toolkit
|system/pytorch ver.|2.5.1|2.7.1|2.8.0|2.9.1|
| :---- | :---: | :---: | :---: | :---: |
|ubuntu/python3.10/pip||
|
|
|
|ubuntu/python3.12/pip|
|
|
|
|
|ubuntu/python3.10/conda||
||
|debian12/python3.10/conda||
||||
|windows/python3.10/pip||
||||
|macos/python3.10/pip||
||||
|macos/python3.10/conda||
||||
Docs | Example (ESPnet2) | Docker | Notebook
</div>ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding, and so on. ESPnet uses pytorch as a deep learning engine and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for various speech processing experiments.
Tutorial Series
- 2019 Tutorial at Interspeech
- 2021 Tutorial at CMU
- 2022 Tutorial at CMU
- Usage of ESPnet (ASR as an example)
- Add new models/tasks to ESPnet
Key Features
Kaldi-style complete recipe
- Support numbers of
ASRrecipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, Gigaspeech, etc.) - Support numbers of
TTSrecipes in a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.) - Support numbers of
STrecipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.) - Support numbers of
MTrecipes (IWSLT'14, IWSLT'16, the above ST recipes etc.) - Support numbers of
SLUrecipes (CATSLU-MAPS, FSC, Grabo, IEMOCAP, JDCINAL, SNIPS, SLURP, SWBD-DA, etc.) - Support numbers of
SE/SSrecipes (DNS-IS2020, LibriMix, SMS-WSJ, VCTK-noisyreverb, WHAM!, WHAMR!, WSJ-2mix, etc.) - Support voice conversion recipe (VCC2020 baseline)
- Support speaker diarization recipe (mini_librispeech, librimix)
- Support singing voice synthesis recipe (ofuton_p_utagoe_db, opencpop, m4singer, etc.)
ASR: Automatic Speech Recognition
- State-of-the-art performance in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
- Hybrid CTC/attention based end-to-end ASR
- Fast/accurate training with CTC/attention multitask training
- CTC/attention joint decoding to boost monotonic alignment decoding
- Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU), Transformer, Conformer, Branchformer, or E-Branchformer
- Decoder: RNN (LSTM/GRU), Transformer, or S4
- Attention: Flash Attention, Dot product, location-aware attention, variants of multi-head
- Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
- Batch GPU decoding
- Data augmentation
- Transducer based end-to-end ASR
- Architecture:
- Search algorithms:
- Greedy search constrained to one emission by timestep.
- Default beam search algorithm [Graves, 2012] without prefix search.
- Alignment-Length Synchronous decoding [Saon et al., 2020].
- Time Synchronous Decoding [Saon et al., 2020].
- N-step Constrained beam search modified from [Kim et al., 2020].
- modified Adaptive Expansion Search based on [Kim et al., 2021] and NSC.
- Features:
- Unified interface for offline and streaming speech recognition.
- Multi-task learning with various auxiliary losses:
- Encoder: CTC, auxiliary Transducer and symmetric KL divergence.
- Decoder: cross-entropy w/ label smoothing.
- Transfer learning with an acoustic model and/or language model.
- Training with FastEmit regularization method [Yu et al., 2021].
Please refer to the tutorial page for complete documentation.
- CTC segmentation
- Non-autoregressive model based on Mask-CTC
- ASR examples for supporting endangered language documentation (Please refer to egs/puebla_nahuatl and egs/yoloxochitl_mixtec for details)
- Wav2Vec2.0 pre-trained model as Encoder, imported from FairSeq.
- Self-supervised learning representations as features, using upstream models in S3PRL in frontend.
- Set
frontendtos3prl - Select any upstream model by setting the
frontend_confto the corresponding name.
- Set
- Transfer Learning :
- easy usage and transfers from models previously trained by your group or models from [ESPnet Hugging Face repository](https://huggingface.co/espn
