MooER
MooER: Moore-threads Open Omni model for speech-to-speech intERaction. MooER-omni includes a series of end-to-end speech interaction models along with training and inference code, covering but not limited to end-to-end speech interaction, end-to-end speech translation and speech recognition.
Install / Use
/learn @MooreThreads/MooERREADME
🔥 Updates
2024/10/24: We released the 1.5B speech translation model MooER-MTL-5K-1.5B.2024/10/24: 🎉🎉🎉 We released the new Omni (MooER-omni-v1) and Speech-To-Speech Translation (MooER-S2ST-v1) models which support Mandarin input. The Omni model can hear, think and talk to you! See our demo here.2024/09/03: We have open-sourced the training and inference code for MooER! You can follow this tutorial to train your own audio understanding model and tasks or fine-tune based on our 80k hours model.2024/08/27: We released MooER-80K-v2 which was trained using 80K hours of data. The performance of the new model can be found below. Currently, it only supports the speech recognition task. The speech translation and the multi-task models will be released soon.2024/08/09: We released a Gradio demo running on Moore Threads S4000.2024/08/09: We released the inference code and the pretrained speech recognition and speech translation (zh->en) models using 5000 hours of data.2024/08/09: We release MooER v0.1 technical report on arXiv.
📝 Roadmap
- [x] Technical report
- [x] Inference code and pretrained ASR/AST models using 5k hours of data
- [x] Pretrained ASR model using 80k hours of data
- [x] Traning code for MooER
- [x] LLM-based speech-to-speech translation (S2ST, Mandrin Chinese to English)
- [x] GPT-4o-like audio-LLM supporting chat using speech
- [x] 1.5B AST models
- [ ] Training code and technical report about our new Omni model
- [ ] Omni audio-LLM that supports multi-turn conversation
- [ ] Pretrained AST and multi-task models using 80k hours of data
- [ ] LLM-based timbre-preserving Speech-to-speech translation
📖 Introduction
🔥 Hi there! We have updated our Omni model with the ability to listen, think and talk! Check our examples here. Please refer to the Download, Inference and Gradio Demo sections for the model usage.
The training procedure of our model is demonstrated in the following figure. We will release the training code and the technical report soon!
<br> <p align="center"> <img src="assets/framework_omni.png"/> <p> <br>We introduce MooER (摩耳): an LLM-based speech recognition and translation model developed by Moore Threads. With the MooER framework, you can transcribe the speech into text (automatic speech recognition, ASR) and translate the speech into other languages (automatic speech translation, AST) in an LLM-based end-to-end manner. Some of the evaluation results of the MooER are presented in the subsequent section. More detailed experiments, along with our insights into model configurations, training strategies, etc, are provided in our technical report.
We proudly highlight that MooER is developed using Moore Threads S4000 GPUs. To the best of our knowledge, this is the first LLM-based speech model trained and inferred using entirely domestic GPUs.
<br> <p align="center"> <img src="assets/framework.png" width="600"/> <p> <br>🥊 Evaluation Results
We present the training data and the evaluation results below. For more comprehensive information, please refer to our report.
Training data
We utilize 5,000 hours of speech data (MT5K) to train our basic MooER-5K model. The data sources include:
| Dataset | Duration | |---------------|---------------| | aishell2 | 137h | | librispeech | 131h | | multi_cn | 100h | | wenetspeech | 1361h | | in-house data | 3274h |
Note that, data from the open-source datasets were randomly selected from the full training set. The in-house speech data, collected internally without transcription, were transcribed using a third-party ASR service.
Since all the above datasets were originally collected only for the speech recognition task, no translation labels are available. We leveraged a third-party machine translation service to generate pseudo-labels for translation. No data filtering techniques were applied.
At this moment, we are also developing a new model trained with 80,000 hours of speech data.
Speech Recognition
The performance of speech recognition is evaluated using word error rate (WER) and character error rate (CER).
<table> <tr> <th>Language</th> <th>Testset</th> <th>Paraformer-large</th> <th>SenseVoice-small</th> <th>Qwen-audio</th> <th>Whisper-large-v3</th> <th>SeamlessM4T-v2</th> <th>MooER-5K</th> <th>MooER-80K</th> <th>MooER-80K-v2</th> </tr> <tr> <td rowspan="7">Chinese</td> <td>aishell1</td> <td>1.93</td> <td>3.03</td> <td>1.43</td> <td>7.86</td> <td>4.09</td> <td>1.93</td> <td>1.25</td> <td>1.00</td> </tr> <tr> <td>aishell2_ios</td> <td>2.85</td> <td>3.79</td> <td>3.57</td> <td>5.38</td> <td>4.81</td> <td>3.17</td> <td>2.67</td> <td>2.62</td> </tr> <tr> <td>test_magicdata</td> <td>3.66</td> <td>3.81</td> <td>5.31</td> <td>8.36</td> <td>9.69</td> <td>3.48</td> <td>2.52</td> <td>2.17</td> </tr> <tr> <td>test_thchs</td> <td>3.99</td> <td>5.17</td> <td>4.86</td> <td>9.06</td> <td>7.14</td> <td>4.11</td> <td>3.14</td> <td>3.00</td> </tr> <tr> <td>fleurs cmn_dev</td> <td>5.56</td> <td>6.39</td> <td>10.54</td> <td>4.54</td> <td>7.12</td> <td>5.81</td> <td>5.23</td> <td>5.15</td> </tr> <tr> <td>fleurs cmn_test</td> <td>6.92</td> <td>7.36</td> <td>11.07</td> <td>5.24</td> <td>7.66</td> <td>6.77</td> <td>6.18</td> <td>6.14</td> </tr> <tr> <td>average</td> <td><strong>4.15</strong></td> <td><strong>4.93</strong></td> <td><strong>6.13</strong></td> <td><strong>6.74</strong></td> <td><strong>6.75</strong></td> <td><strong>4.21</strong></td> <td><strong>3.50</strong></td> <td><strong>3.35</strong></td> </tr> <tr> <td rowspan="7">English</td> <td>librispeech test_clean</td> <td>14.15</td> <td>4.07</td> <td>2.15</td> <td>3.42</td> <td>2.77</td> <td>7.78</td> <td>4.11</td> <td>3.57</td> </tr> <tr> <td>librispeech test_other</td> <td>22.99</td> <td>8.26</td> <td>4.68</td> <td>5.62</td> <td>5.25</td> <td>15.25</td> <td>9.99</td> <td>9.09</td> </tr> <tr> <td>fleurs eng_dev</td> <td>24.93</td> <td>12.92</td> <td>22.53</td> <td>11.63</td> <td>11.36</td> <td>18.89</td> <td>13.32</td> <td>13.12</td> </tr> <tr> <td>fleurs eng_test</td> <td>26.81</td> <td>13.41</td> <td>22.51</td> <td>12.57</td> <td>11.82</td> <td>20.41</td> <td>14.97</td> <td>14.74</td> </tr> <tr> <td>gigaspeech dev</td> <td>24.23</td> <td>19.44</td> <td>12.96</td> <td>19.18</td> <td>28.01</td> <td>23.46</td> <td>16.92</td> <td>17.34</td> </tr> <tr> <td>gigaspeech test</td> <td>23.07</td> <td>16.65</td> <td>13.