IndicTrans
indicTranslate v1 - Machine Translation for 11 Indic languages. For latest v2, check: https://github.com/AI4Bharat/IndicTrans2
Install / Use
/learn @AI4Bharat/IndicTransREADME
🚩NOTE 🚩IndicTrans2 is now available. It supports 22 Indian languages and has better translation quality compared to IndicTrans1. We recommend using IndicTrans2.
IndicTrans is a Transformer-4x ( ~434M ) multilingual NMT model trained on Samanantar dataset which is the largest publicly available parallel corpora collection for Indic languages at the time of writing ( 14 April 2021 ). It is a single script model i.e we convert all the Indic data to the Devanagari script which allows for better lexical sharing between languages for transfer learning, prevents fragmentation of the subword vocabulary between Indic languages and allows using a smaller subword vocabulary. We currently release two models - Indic to English and English to Indic and support the following 11 indic languages:
| <!-- --> | <!-- --> | <!-- --> | <!-- --> | | ------------- | -------------- | ------------ | ----------- | | Assamese (as) | Hindi (hi) | Marathi (mr) | Tamil (ta) | | Bengali (bn) | Kannada (kn) | Odia (or) | Telugu (te) | | Gujarati (gu) | Malayalam (ml) | Punjabi (pa) |
Benchmarks
We evaluate IndicTrans model on a WAT2021, WAT2020, WMT (2014, 2019, 2020), UFAL, PMI (subset of the PMIndia dataest created by us for Assamese) and FLORES benchmarks. It outperforms all publicly available open source models. It also outperforms commercial systems like Google, Bing translate on most datasets and performs competitively on Flores. Here are the results that we obtain:
<!-- <style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-9wq8{border-color:inherit;text-align:center;vertical-align:middle} </style> --> <table class="tg"> <thead> <tr> <th class="tg-9wq8"></th> <th class="tg-9wq8" colspan="10">WAT2021</th> <th class="tg-9wq8" colspan="7">WAT2020</th> <th class="tg-9wq8" colspan="3">WMT</th> <th class="tg-9wq8">UFAL</th> <th class="tg-9wq8">PMI</th> <th class="tg-9wq8" colspan="11">FLORES-101</th> </tr> </thead> <tbody> <tr> <td class="tg-9wq8"></td> <td class="tg-9wq8">bn</td> <td class="tg-9wq8">gu</td> <td class="tg-9wq8">hi</td> <td class="tg-9wq8">kn</td> <td class="tg-9wq8">ml</td> <td class="tg-9wq8">mr</td> <td class="tg-9wq8">or</td> <td class="tg-9wq8">pa</td> <td class="tg-9wq8">ta</td> <td class="tg-9wq8">te</td> <td class="tg-9wq8">bn</td> <td class="tg-9wq8">gu</td> <td class="tg-9wq8">hi</td> <td class="tg-9wq8">ml</td> <td class="tg-9wq8">mr</td> <td class="tg-9wq8">ta</td> <td class="tg-9wq8">te</td> <td class="tg-9wq8">hi</td> <td class="tg-9wq8">gu</td> <td class="tg-9wq8">ta</td> <td class="tg-9wq8">ta</td> <td class="tg-9wq8">as</td> <td class="tg-9wq8">as</td> <td class="tg-9wq8">bn</td> <td class="tg-9wq8">gu</td> <td class="tg-9wq8">hi</td> <td class="tg-9wq8">kn</td> <td class="tg-9wq8">ml</td> <td class="tg-9wq8">mr</td> <td class="tg-9wq8">or</td> <td class="tg-9wq8">pa</td> <td class="tg-9wq8">ta</td> <td class="tg-9wq8">te</td> </tr> <tr> <td class="tg-9wq8">IN-EN</td> <td class="tg-9wq8">29.6</td> <td class="tg-9wq8">40.3</td> <td class="tg-9wq8">43.9</td> <td class="tg-9wq8">36.4</td> <td class="tg-9wq8">34.6</td> <td class="tg-9wq8">33.5</td> <td class="tg-9wq8">34.4</td> <td class="tg-9wq8">43.2</td> <td class="tg-9wq8">33.2</td> <td class="tg-9wq8">36.2</td> <td class="tg-9wq8">20.0</td> <td class="tg-9wq8">24.1</td> <td class="tg-9wq8">23.6</td> <td class="tg-9wq8">20.4</td> <td class="tg-9wq8">20.4</td> <td class="tg-9wq8">18.3</td> <td class="tg-9wq8">18.5</td> <td class="tg-9wq8">29.7</td> <td class="tg-9wq8">25.1</td> <td class="tg-9wq8">24.1</td> <td class="tg-9wq8">30.2</td> <td class="tg-9wq8">29.9</td> <td class="tg-9wq8">23.3</td> <td class="tg-9wq8">32.2</td> <td class="tg-9wq8">34.3</td> <td class="tg-9wq8">37.9</td> <td class="tg-9wq8">28.8</td> <td class="tg-9wq8">31.7</td> <td class="tg-9wq8">30.8</td> <td class="tg-9wq8">30.1</td> <td class="tg-9wq8">35.8</td> <td class="tg-9wq8">28.6</td> <td class="tg-9wq8">33.5</td> </tr> <tr> <td class="tg-9wq8">EN-IN</td> <td class="tg-9wq8">15.3</td> <td class="tg-9wq8">25.6</td> <td class="tg-9wq8">38.6</td> <td class="tg-9wq8">19.1</td> <td class="tg-9wq8">14.7</td> <td class="tg-9wq8">20.1</td> <td class="tg-9wq8">18.9</td> <td class="tg-9wq8">33.1</td> <td class="tg-9wq8">13.5</td> <td class="tg-9wq8">14.1</td> <td class="tg-9wq8">11.4</td> <td class="tg-9wq8">15.3</td> <td class="tg-9wq8">20.0</td> <td class="tg-9wq8">7.2</td> <td class="tg-9wq8">12.7</td> <td class="tg-9wq8">6.2</td> <td class="tg-9wq8">7.6</td> <td class="tg-9wq8">25.5</td> <td class="tg-9wq8">17.2</td> <td class="tg-9wq8">9.9</td> <td class="tg-9wq8">10.9</td> <td class="tg-9wq8">11.6</td> <td class="tg-9wq8">6.9</td> <td class="tg-9wq8">20.3</td> <td class="tg-9wq8">22.6</td> <td class="tg-9wq8">34.5</td> <td class="tg-9wq8">18.9</td> <td class="tg-9wq8">16.3</td> <td class="tg-9wq8">16.1</td> <td class="tg-9wq8">13.9</td> <td class="tg-9wq8">26.9</td> <td class="tg-9wq8">16.3</td> <td class="tg-9wq8">22.0</td> </tr> </tbody> </table>Updates
<details><summary>Click to expand </summary> 21 June 2022Add more documentation on hosted API usage
18 December 2021
Tutorials updated with latest model links
26 November 2021
- v0.3 models are now available for download
27 June 2021
- Updated links for indic to indic model
- Add more comments to training scripts
- Add link to [Samanantar Video](https://youtu.be/QwYPOd1eBtQ?t=383)
- Add folder structure in readme
- Add python wrapper for model inference
09 June 2021
- Updated links for models
- Added Indic to Indic model
09 May 2021
- Added fix for finetuning on datasets where some lang pairs are not present. Previously the script assumed the finetuning dataset will have data for all 11 indic lang pairs
- Added colab notebook for finetuning instructions
</details>
Table of contents
- Updates
- Table of contents
- Resources
- Running Inference
- Training model
- Finetuning model on your data
- License
- Contributors
- Contact
- Acknowledgements
Resources
Try out model online (Huggingface spaces)
Download model
Indic to English: v0.3
English to Indic: v0.3
Indic to Indic: v0.3
Mirror links for the IndicTrans models
STS Benchmark
Download the human annotations for STS benchmark here
Using hosted APIs
Try out our models at IndicTrans Demos
<!-- <details><summary>Click to expand </summary> Please visit [API documentation](http://216.48.181.177:5050/docs#) to read more about the available API endpoints/methods you can use. #### Sample screenshot of translate_sentence POST request Go to [API documentation](http://216.48.181.177:5050/docs#), scroll to translate_sentence POST request endpoint and click "Try it out" button. <br> <p align="left"> <img src="./sample_images/translate_try_it_out.png" width=50% height=50% /> </p> <br> To try english to tamil translation, set the source language to "en