Openspeech
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.
Install / Use
/learn @openspeech-team/OpenspeechREADME
<img src="https://raw.githubusercontent.com/openspeech-team/openspeech/55e50cb9b3cc3e7a6dfddcd33e6e698cca3dae3b/docs/img/logo.png" height=20> OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recognition. We aim to make ASR technology easier to use for everyone.
<img src="https://raw.githubusercontent.com/openspeech-team/openspeech/55e50cb9b3cc3e7a6dfddcd33e6e698cca3dae3b/docs/img/logo.png" height=20> OpenSpeech is backed by the two powerful libraries — PyTorch-Lightning and Hydra. Various features are available in the above two libraries, including Multi-GPU and TPU training, Mixed-precision, and hierarchical configuration management.
<img src="https://raw.githubusercontent.com/openspeech-team/openspeech/55e50cb9b3cc3e7a6dfddcd33e6e698cca3dae3b/docs/img/logo.png" height=20> We appreciate any kind of feedback or contribution. Feel free to proceed with small issues like bug fixes, documentation improvement. For major contributions and new features, please discuss with the collaborators in corresponding issues.
What's New
- May 2022 openspeech 0.4.0 released
- Aug 2021 Added Smart Batching
- Jul 2021 openspeech 0.3.0 released
- Jul 2021 Added transducer beam search logic
- Jun 2021 Added ContextNet
- Jun 2021 Added language model training pipeline
- Jun 2021 openspeech 0.1.0 released
Contents
What is OpenSpeech?
OpenSpeech is a framework for making end-to-end speech recognizers. End-to-end (E2E) automatic speech recognition (ASR) is an emerging paradigm in the field of neural network-based speech recognition that offers multiple benefits. Traditional “hybrid” ASR systems, which are comprised of an acoustic model, language model, and pronunciation model, require separate training of these components, each of which can be complex.
For example, training of an acoustic model is a multi-stage process of model training and time alignment between the speech acoustic feature sequence and output label sequence. In contrast, E2E ASR is a single integrated approach with a much simpler training pipeline with models that operate at low audio frame rates. This reduces the training time, decoding time, and allows joint optimization with downstream processing such as natural language understanding.
Because of these advantages, many end-to-end speech recognition related open sources have emerged. But, Many of them are based on basic PyTorch or Tensorflow, it is very difficult to use various functions such as mixed-precision, multi-node training, and TPU training etc. However, with frameworks such as PyTorch-Lighting, these features can be easily used. So we have created a speech recognition framework that introduced PyTorch-Lightning and Hydra for easy use of these advanced features.
Why should I use OpenSpeech?
- PyTorch-Lighting base framework.
- Various functions: mixed-precision, multi-node training, tpu training etc.
- Models become hardware agnostic
- Make fewer mistakes because lightning handles the tricky engineering
- Lightning has dozens of integrations with popular machine learning tools.
- Easy-to-experiment with the famous ASR models.
- Supports 20+ models and is continuously updated.
- Low barrier to entry for educators and practitioners.
- Save time for researchers who want to conduct various experiments.
- Provides recipes for the most widely used languages, English, Chinese, and + Korean.
- LibriSpeech - 1,000 hours of English dataset most widely used in ASR tasks.
- AISHELL-1 - 170 hours of Chinese Mandarin speech corpus.
- KsponSpeech - 1,000 hours of Korean open-domain dialogue speech.
- Easily customize a model or a new dataset to your needs:
- The default hparams of the supported models are provided but can be easily adjusted.
- Easily create a custom model by combining modules that are already provided.
- If you want to use the new dataset, you only need to define a
pl.LightingDataModuleandTokenizerclasses.
- Audio processing
- Representative audio features such as Spectrogram, Mel-Spectrogram, Filter-Bank, and MFCC can be used easily.
- Provides a variety of augmentation, including SpecAugment, Noise Injection, and Audio Joining.
Why shouldn't I use OpenSpeech?
- This framework provides code for training ASR models, but does not provide APIs by pre-trained models.
- This framework does not provides pre-trained models.
Model architectures
We support all the models below. Note that, the important concepts of the model have been implemented to match, but the details of the implementation may vary.
- DeepSpeech2 (from Baidu Research) released with paper Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, by Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, Zhenyao Zhu.
- RNN-Transducer (from University of Toronto) released with paper Sequence Transduction with Recurrent Neural Networks, by Alex Graves.
- LSTM Language Model (from RWTH Aachen University) released with paper LSTM Neural Networks for Language Modeling, by Martin Sundermeyer, Ralf Schluter, and Hermann Ney.
- Listen Attend Spell (from Carnegie Mellon University and Google Brain) released with paper Listen, Attend and Spell, by William Chan, Navdeep Jaitly, Quoc V. Le, Oriol Vinyals.
- Location-aware attention based Listen Attend Spell (from University of Wrocław and Jacobs University and Universite de Montreal) released with paper Attention-Based Models for Speech Recognition, by Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio.
- Joint CTC-Attention based Listen Attend Spell (from Mitsubishi Electric Research Laboratories and Carnegie Mellon University) released with paper Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning, by Suyoun Kim, Takaaki Hori, Shinji Watanabe.
- Deep CNN Encoder with Joint CTC-Attention Listen Attend Spell (from Mitsubishi Electric Research Laboratories and Massachusetts Institute of Technology and Carnegie Mellon University) released with paper [Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM](https://arxiv.org/abs/
