Athena
an open-source implementation of sequence-to-sequence based speech processing engine
Install / Use
/learn @athena-team/AthenaREADME
Athena
Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing. To make speech processing available to everyone, we're also releasing example implementation and recipe on some opensource dataset for various tasks (Automatic Speech Recognition, Speech Synthesis, Voice activity detection, Wake Word Spotting, etc).
All of our models are implemented in Tensorflow>=2.0.1. For ease of use, we provide Kaldi-free pythonic feature extractor with Athena_transform.
Key Features
- Hybrid Attention/CTC based end-to-end and streaming methods(ASR)
- Text-to-Speech(FastSpeech/FastSpeech2/Transformer)
- Voice activity detection(VAD)
- Key Word Spotting with end-to-end and streaming methods(KWS)
- ASR Unsupervised pre-training(MPC)
- Multi-GPU training on one machine or across multiple machines with Horovod
- WFST creation and WFST-based decoding with C++
- Deployment with Tensorflow C++(Local server)
Versions
- Athena v2.0(latest, current master version)
- Athena v1.0
What's new
- 2022/06/01 The Athena-model-zoo is built.
- 2022/05/13 The runtime supports C++ decoding(E2E, Streaming, WFST, PrefixBeamSearch, etc) and Deployment
- 2022/05/10 The functions about adding noise and rir aecres to clean noise are added to the transform
- 2022/04/25 The functions about KWS are added to Athena-v2.0
- 2022/04/07 The AV-Cransformer for ASR and MISP2021 task2 example are added to Athena-v2.0
- 2022/04/05 The Transformer-U2 and examples are added to Athena-v2.0
- 2022/03/20 The CTC alignment function is add to Athena-v2.0
- 2022/03/10 Vad function and examples are add to Athena-v2.0
- 2022/03/09 The Fastspeech2 is added to Athena-v2.0
- 2022/03/02 Add speaker embedding to Athena-v2.0
- 2021/11/27 The Conformer-CTC is added to Athena-v2.0
- 2021/10/31 ASR Performances updates of these E2E ASR examples(AISHELL-1, HKUST, GigaSpeech, LibriSpeech)
- 2021/09/01 The Beamsearch and Attention rescoring are updated
- 2021/08/18 The Batchbins function is added to Athena-v2.0
- 2021/06/01 Language model updates and add Transformer model structure
- 2021/05/19 The parameters of Horovod are adjusted to speed up the training
- 2021/05/07 These training parameters of E2E ASR models are adjusted
- 2021/04/09 SpecAugment bug is fixed
Discussion & Communication
We have set up a WeChat group for discussion. Please scan the QR and then the administrator will invite you to the group, if you want to join it.
<img src="https://github.com/JianweiSun007/athena/blob/athena-v0.2/docs/image/img1.png" width="250px">1) Table of Contents
2) Installation
Athena can be installed based on Tensorflow2.3 and Tensorflow2.8 successfully.
- Athena-v2.0 installed based on Tensorflow2.3:
pip install tensorflow-gpu==2.3.0
pip install -r requirements.txt
python setup.py bdist_wheel sdist
python -m pip install --ignore-installed dist/athena-2.0*.whl
- Athena-v2.0 installed based on Tensorflow2.8:
pip install tensorflow-gpu==2.8.0
pip install -r requirements.txt
python setup.tf2.8.py bdist_wheel sdist
python -m pip install --ignore-installed dist/athena-2.0*.whl
3) Results
3.1) ASR
The performances of a part of models are shown as follow:
<details><summary>expand</summary><div>| Model | LM | HKUST | AISHELL1 Dataset | | LibriSpeech Dataset | | | | Giga | | MISP | Model link | |:-----------------:|:---:|:-----:|:--------:|:----:|:-----------:|:----------:|:-----------:|:-----------:|:----:|:-----:|:-----:|------------| | | | CER% | CER% | | WER% | | | | WER% | | CER% | | | | | dev | dev | test | dev clean | dev other | test clean | test other | dev | test | - | | | transformer | w | 21.64 | - | 5.13 | - | - | - | - | - | 11.70 | - | | | | w/o | 21.87 | - | 5.22 | 3.84 | - | 3.96 | 9.70 | - | - | - | | | transformer-u2 | w | - | - | - | - | - | - | - | - | - | - | | | | w/o | - | - | 6.38 | - | - | - | - | - | - | - | | | conformer | w | 21.33 | - | 4.95 | - | - | - | - | - | - | 50.50 | | | | w/o | 21.59 | - | 5.04 | - | - | - | - | - | - | - | | | conformer-u2 | w | - | - | - | - | - | - | - | - | - | - | | | | w/o | - | - | 6.29 | - | - | - | - | - | - | - | | | conformer-CTC | w | - | - | - | - | - | - | - | - | - | - | | | | w/o | - | - | 6.60 | - | - | - | - | - | - | - | |
</div></details>To compare with other published results, see wer_are_we.md.
More details of U2, see ASR readme
3.2) TTS
Currently supported TTS tasks are LJSpeech and Chinese Standard Mandarin Speech Copus(data_baker). Supported models are shown in the table below: (Note:HiFiGAN is trained based on TensorflowTTS)
The performance of Athena-TTS are shown as follow:
<details><summary>expand</summary><div>Traing Data | Acoustic Model | Vocoder | Audio Demo :---------: |:-------------: | :-------------:| :------------: data_baker |Tacotron2 | GL | audio_demo data_baker |Transformer_tts | GL | audio_demo data_baker |Fastspeech | GL | audio_demo data_baker |Fastspeech2 | GL | audio_demo data_baker |Fastspeech2 | HiFiGAN | audio_demo ljspeech |Tacotron2 | GL | audio_demo
</div></details>More details see TTS readme
3.3) VAD
<details><summary>expand</summary><div>Task | Model Name | Training Data | Input Segment | Frame Error Rate :-----------: | :------: | :------------: | :-----: | :----------: VAD | DNN | Google Speech Commands Dataset V2 | 0.21s | 8.49% VAD | MarbleNet | Google Speech Commands Dataset V2 | 0.63s | 2.50%
</div></details>More details see VAD readme
3.4) KWS
The performances on MISP2021 task1 dataset are shown as follow:
<details><summary>expand</summary><div>| KWS Type | Model | Model Detail | Data | Loss | Dev | Eval | |:---------:|:--------------:|:---------------------------:|:--------------------:|:--------:|:-----:|:-----:| | Streaming | CNN-DNN | 2 Conv+3 Dense | 60h pos+200h neg | CE | 0.314 | / | | E2E | CRNN | 2 Conv+2 biGRU | 60h pos+200h neg | CE | 0.209 | / | | E2E | CRNN | Conv+5 biLSTM | 60h pos+200h neg | CE | 0.186 | / | | E2E | CRNN | Conv+5 biLSTM | 170h pos+530h neg | CE | 0.178 | / | | E2E | A-Transformer | Conv+4 encoders+1 Dense | 170h pos+530h neg | CE&Focal | 0.109 | 0.106 | | E2E | A-Conformer | Conv+4 encoders+1 Dense | 170h pos+530h neg | CE&Focal | 0.105 | 0.116 | | E2E | AV-Transformer | 2 Conv+4 AV-encoders+1Dense | A(170h pos+530h neg)+V(Far 124h) | CE | 0.132 | / |
</div></details>More details you can see: KWS readme
3.5) CTC-Alignment
The CTC alignment result of one utteran
