An Apache 2.0 ASR research library, built on PyTorch, for developing end-to-end speech recognition models.

</div>

<p align="center"> <a href="https://github.com/sooftware/kospeech#introduction">Introduction</a> • <a href="https://github.com/sooftware/kospeech#introduction">Roadmap</a> • <a href="sooftware.github.io/kospeech/">Docs</a> • <a href="https://www.codefactor.io/repository/github/sooftware/kospeech">Codefactor</a> • <a href="https://github.com/sooftware/kospeech/blob/main/LICENSE">License</a> • <a href="https://gitter.im/Korean-Speech-Recognition/community">Gitter</a> • <a href="https://www.sciencedirect.com/science/article/pii/S2665963821000026">Paper</a> </p>

</div>

This repository archived. If the reason why you found this repo is below, I will recommend a different repository for each reason.

I want to train my own voice recognition model or study internal code! → OpenSpeech
I want to test the trained Korean speech recognition model right away! → Pororo ASR or Whisper

What's New

May 2021: Fix LayerNorm Error, Subword Error
Febuary 2021: Update Documentation
Febuary 2021: Add RNN-Transducer model
January 2021: Release v1.3
January 2021: Add Conformer model
January 2021: Add Jasper model
January 2021: Add Joint CTC-Attention Transformer model
January 2021: Add Speech Transformer model
January 2021: Apply Hydra: framework for elegantly configuring complex applications

Note

Not long ago, I modified a lot of the code, but I was personally busy, so I couldn't test all the cases. If there is an error, please feel free to give me a feedback.
Subword and Grapheme unit currently not tested.

KoSpeech: Open-Source Toolkit for End-to-End Korean Speech Recognition [Paper]

KoSpeech, an open-source software, is modular and extensible end-to-end Korean automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch. Several automatic speech recognition open-source toolkits have been released, but all of them deal with non-Korean languages, such as English (e.g. ESPnet, Espresso). Although AI Hub opened 1,000 hours of Korean speech corpus known as KsponSpeech, there is no established preprocessing method and baseline model to compare model performances. Therefore, we propose preprocessing methods for KsponSpeech corpus and a several models (Deep Speech 2, LAS, Transformer, Jasper, Conformer). By KoSpeech, we hope this could be a guideline for those who research Korean speech recognition.

Supported Models

|Acoustic Model|Notes|Citation|
|--------------|------|--------:|
|Deep Speech 2|2D-invariant convolution & RNN & CTC|Dario Amodei et al., 2015|
|Listen Attend Spell (LAS)|Attention based RNN sequence to sequence|William Chan et al., 2016|
|Joint CTC-Attention LAS|Joint CTC-Attention LAS|Suyoun Kim et al., 2017|
|RNN-Transducer|RNN Transducer|Ales Graves. 2012|
|Speech Transformer|Convolutional extractor & transformer|Linhao Dong et al., 2018|
|Jasper|Fully convolutional & dense residual connection & CTC|Jason Li et al., 2019|
|Conformer|Convolution-augmented-Transformer|Anmol Gulati et al., 2020|

Note
It is based on the above papers, but there may be other parts of the model implementation.

Introduction

End-to-end (E2E) automatic speech recognition (ASR) is an emerging paradigm in the field of neural network-based speech recognition that offers multiple benefits. Traditional “hybrid” ASR systems, which are comprised of an acoustic model, language model, and pronunciation model, require separate training of these components, each of which can be complex.

For example, training of an acoustic model is a multi-stage process of model training and time alignment between the speech acoustic feature sequence and output label sequence. In contrast, E2E ASR is a single integrated approach with a much simpler training pipeline with models that operate at low audio frame rates. This reduces the training time, decoding time, and allows joint optimization with downstream processing such as natural language understanding.

Roadmap

So far, serveral models are implemented: Deep Speech 2, Listen Attend and Spell (LAS), RNN-Transducer, Speech Transformer, Jasper, Conformer.

Deep Speech 2

Deep Speech 2 showed faster and more accurate performance on ASR tasks with Connectionist Temporal Classification (CTC) loss. This model has been highlighted for significantly increasing performance compared to the previous end- to-end models.

Listen, Attend and Spell (LAS)

We follow the architecture previously proposed in the "Listen, Attend and Spell", but some modifications were added to improve performance. We provide four different attention mechanisms, scaled dot-product attention, additive attention, location aware attention, multi-head attention. Attention mechanisms much affect the performance of models.

RNN-Transducer

RNN-Transducer are a form of sequence-to-sequence models that do not employ attention mechanisms. Unlike most sequence-to-sequence models, which typically need to process the entire input sequence (the waveform in our case) to produce an output (the sentence), the RNN-T continuously processes input samples and streams output symbols, a property that is welcome for speech dictation. In our implementation, the output symbols are the characters of the alphabet.

Speech Transformer

Transformer is a powerful architecture in the Natural Language Processing (NLP) field. This architecture also showed good performance at ASR tasks. In addition, as the research of this model continues in the natural language processing field, this model has high potential for further development.

Joint CTC-Attention

With the proposed architecture to take advantage of both the CTC-based model and the attention-based model. It is a structure that makes it robust by adding CTC to the encoder. Joint CTC-Attention can be trained in combination with LAS and Speech Transformer.

Jasper

Jasper (Just Another SPEech Recognizer) is a end-to-end convolutional neural acoustic model. Jasper showed powerful performance with only CNN → BatchNorm → ReLU → Dropout block and residential connection.

Conformer

Conformer combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.

Installation

This project recommends Python 3.7 or higher.
We recommend creating a new virtual environment for this project (using virtual env or conda).

Prerequisites

Numpy: pip install numpy (Refer here for problem installing Numpy).
Pytorch: Refer to PyTorch website to install the version w.r.t. your environment.
Pandas: pip install pandas (Refer here for problem installing Pandas)
Matplotlib: pip install matplotlib (Refer here for problem installing Matplotlib)
librosa: conda install -c conda-forge librosa (Refer here for problem installing librosa)
torchaudio: pip install torchaudio==0.6.0 (Refer here for problem installing torchaudio)
tqdm: pip install tqdm (Refer here for problem installing tqdm)
sentencepiece: pip install sentencepiece (Refer here for problem installing sentencepiece)
warp-rnnt: pip install warp_rnnt (Refer here) for problem installing warp-rnnt)
hydra: pip install hydra-core --upgrade (Refer here for problem installing hydra)

Install from source

Currently we only support installation from source code using setuptools. Checkout the source code and run the
following commands:

pip install -e .

Get Started

We use Hydra to control all the training configurations. If you are not familiar with Hydra we recommend visiting the Hydra website. Generally, Hydra is an open-source framework that simplifies the development of research applications by providing the ability to create a hierarchical configuration dynamically.

Preparing KsponSpeech Dataset (LibriSpeech also supports)

Download from here or refer to the following to preprocess.

KsponSpeech : Check this page
LibriSpeech : Check this page

Training KsponSpeech Dataset

You can choose from several models and training options. There are many other training options, so look carefully and execute the following command:

Deep Speech 2 Training

python ./bin/main.py model=ds2 train=ds2_train train.

Kospeech

Install / Use

README