SkillAgentSearch skills...

Espnet

End-to-End Speech Processing Toolkit

Install / Use

/learn @espnet/Espnet

README

<div align="left"><img src="doc/image/espnet_logo1.png" width="550"/></div>

ESPnet: end-to-end speech processing toolkit

|system/pytorch ver.|2.5.1|2.7.1|2.8.0|2.9.1| | :---- | :---: | :---: | :---: | :---: | |ubuntu/python3.10/pip|ci on ubuntu|ci on ubuntu|ci on ubuntu|ci on ubuntu| |ubuntu/python3.12/pip|ci on ubuntu|ci on ubuntu|ci on ubuntu|ci on ubuntu| |ubuntu/python3.10/conda||ci on debian12|| |debian12/python3.10/conda||ci on debian12|||| |windows/python3.10/pip||ci on windows|||| |macos/python3.10/pip||ci on macos|||| |macos/python3.10/conda||ci on macos||||

<div align="center">

PyPI version Python Versions Downloads GitHub license codecov Code style: black Imports: isort pre-commit.ci status Mergify Status Discord


Docs | Example (ESPnet2) | Docker | Notebook

</div>

ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding, and so on. ESPnet uses pytorch as a deep learning engine and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for various speech processing experiments.

Tutorial Series

Key Features

Kaldi-style complete recipe

  • Support numbers of ASR recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, Gigaspeech, etc.)
  • Support numbers of TTS recipes in a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.)
  • Support numbers of ST recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.)
  • Support numbers of MT recipes (IWSLT'14, IWSLT'16, the above ST recipes etc.)
  • Support numbers of SLU recipes (CATSLU-MAPS, FSC, Grabo, IEMOCAP, JDCINAL, SNIPS, SLURP, SWBD-DA, etc.)
  • Support numbers of SE/SS recipes (DNS-IS2020, LibriMix, SMS-WSJ, VCTK-noisyreverb, WHAM!, WHAMR!, WSJ-2mix, etc.)
  • Support voice conversion recipe (VCC2020 baseline)
  • Support speaker diarization recipe (mini_librispeech, librimix)
  • Support singing voice synthesis recipe (ofuton_p_utagoe_db, opencpop, m4singer, etc.)

ASR: Automatic Speech Recognition

  • State-of-the-art performance in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
  • Hybrid CTC/attention based end-to-end ASR
    • Fast/accurate training with CTC/attention multitask training
    • CTC/attention joint decoding to boost monotonic alignment decoding
    • Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU), Transformer, Conformer, Branchformer, or E-Branchformer
    • Decoder: RNN (LSTM/GRU), Transformer, or S4
  • Attention: Flash Attention, Dot product, location-aware attention, variants of multi-head
  • Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
  • Batch GPU decoding
  • Data augmentation
  • Transducer based end-to-end ASR
    • Architecture:
      • Custom encoder supporting RNNs, Conformer, Branchformer (w/ variants), 1D Conv / TDNN.
      • Decoder w/ parameters shared across blocks supporting RNN, stateless w/ 1D Conv, MEGA, and RWKV.
      • Pre-encoder: VGG2L or Conv2D available.
    • Search algorithms:
    • Features:
      • Unified interface for offline and streaming speech recognition.
      • Multi-task learning with various auxiliary losses:
        • Encoder: CTC, auxiliary Transducer and symmetric KL divergence.
        • Decoder: cross-entropy w/ label smoothing.
      • Transfer learning with an acoustic model and/or language model.
      • Training with FastEmit regularization method [Yu et al., 2021].

    Please refer to the tutorial page for complete documentation.

  • CTC segmentation
  • Non-autoregressive model based on Mask-CTC
  • ASR examples for supporting endangered language documentation (Please refer to egs/puebla_nahuatl and egs/yoloxochitl_mixtec for details)
  • Wav2Vec2.0 pre-trained model as Encoder, imported from FairSeq.
  • Self-supervised learning representations as features, using upstream models in S3PRL in frontend.
    • Set frontend to s3prl
    • Select any upstream model by setting the frontend_conf to the corresponding name.
  • Transfer Learning :
    • easy usage and transfers from models previously trained by your group or models from [ESPnet Hugging Face repository](https://huggingface.co/espn
View on GitHub
GitHub Stars9.8k
CategoryEducation
Updated12h ago
Forks2.4k

Languages

Python

Security Score

100/100

Audited on Mar 22, 2026

No findings