IndicXlit
Transliteration models for 21 Indic languages
Install / Use
/learn @AI4Bharat/IndicXlitREADME
IndicXlit is a transformer-based multilingual transliteration model (~11M) that supports 21 Indic languages for Roman to native script and native to Roman script conversions. It is trained on Aksharantar dataset which is the largest publicly available parallel corpus containing 26 million word pairs spanning 20 Indic languages at the time of writing (5 May 2022). It supports the following 21 Indic languages:
<!-- list of languages IndicXlit supports -->| <!-- --> | <!-- --> | <!-- --> | <!-- --> | <!-- --> | <!-- --> | | -------------- | -------------- | -------------- | --------------- | -------------- | ------------- | | Assamese (asm) | Bengali (ben) | Bodo (brx) | Gujarati (guj) | Hindi (hin) | Kannada (kan) | | Kashmiri (kas) | Konkani (gom) | Maithili (mai) | Malayalam (mal) | Manipuri (mni) | Marathi (mar) | | Nepali (nep) | Oriya (ori) | Panjabi (pan) | Sanskrit (san) | Sindhi (snd) | Sinhala (sin) | | Tamil (tam) | Telugu (tel) | Urdu (urd) |
Evaluation Results
IndicXlit is evaluated on Dakshina benchmark and Aksharantar benchmark. IndicXlit achieves state-of-the-art results on the Dakshina testset and also provides baseline results on the new Aksharantar testset. The Top-1 results are summarized below. For more details, refer our paper.
En-Indic Results
| Languages | asm | ben | brx | guj | hin | kan | kas | kok | mai | mal | mni | mar | nep | ori | pan | san | tam | tel | urd | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Dakshina | - | 55.49 | - | 62.02 | 60.56 | 77.18 | - | - | - | 63.56 | - | 64.85 | - | - | 47.24 | - | 68.10 | 73.38 | 42.12 | 61.45 | | Aksharantar (native words) | 60.27 | 61.70 | 70.79 | 61.89 | 55.59 | 76.18 | 28.76 | 63.06 | 72.06 | 64.73 | 83.19 | 63.72 | 80.25 | 58.90 | 40.27 | 78.63 | 69.78 | 84.69 | 48.37 | | Aksharantar (named entities) | 38.62 | 37.12 | 30.32 | 48.89 | 58.87 | 49.92 | 20.23 | 34.36 | 42.82 | 33.93 | 44.12 | 53.57 | 52.67 | 30.63 | 36.08 | 24.06 | 42.12 | 51.82 | 47.77 |
Indic-En Results
| Languages | asm | ben | brx | guj | hin | kan | kas | kok | mai | mal | mni | mar | nep | ori | pan | san | tam | tel | urd | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Aksharantar (native words) | 75.55 | 11.76 | 68.58 | 34.35 | 52.30 | 76.25 | 55.94 | 38.96 | 65.04 | 65.55 | 84.64 | 36.15 | 82.38 | 53.65 | 29.05 | 67.4 | 45.08 | 54.55 | 29.95 | | Aksharantar (named entities) | 37.86 | 52.54 | 35.49 | 50.56 | 59.22 | 60.77 | 12.87 | 35.09 | 38.18 | 45.23 | 34.87 | 56.76 | 54.05 | 47.68 | 48.00 | 42.71 | 35.46 | 57.57 | 23.14 |
Zero-shot results on Dogri (using hi prefix token)
Aksharantar (native words): 37.72
<!-- index with hyperlinks (Table of contents) -->Table of contents
- Table of contents
- Resources
- Running inference
- Training model
- Finetuning model on your data
- Mining details
- Directory structure
- Citing
- Acknowledgements
Resources
Download IndicXlit model
<!-- hyperlinks for downloading the models -->Roman to Indic model v1.0
Indic to Roman model v1.0
<!-- mirror links set up the public drive -->Download Aksharantar Dataset
Aksharantar-Dataset: Huggingface
Download Aksharantar test set
Transliteration-sentence-pairs
Using hosted APIs
Roman to Indic Interface
Indic to Roman Interface
<details><summary> Click to expand </summary>Sample screenshot of sentence transliteration
<br> <p align="left"> <img src="./sample_images/main_page.png" width=50% height=50% /> </p>Select the language from drop-down list given at top left corner: <br>
<p align="left"> <img src="./sample_images/select_language.png" width=50% height=50% /> </p>To transliterate into Hindi, select Hindi from the list and enter your sentence in the "text" field: <br>
<p align="left"> <img src="./sample_images/transliterate_sentence.png" width=50% height=50% /> </p> <br> </details>Accessing on ULCA
You can try out our model at ULCA under Transliteration Models, and the Aksharantar dataset under Transliteration Benchmark Datasets.
Running Inference
Command line interface
<!-- ## Using the model to transliterate the inputs -->The model is trained on words as inputs. Hence, users need to split sentences into words before running the transliteration model when using our command line interface.
Follow the Colab notebook to setup the environment, download the trained IndicXlit model and transliterate your own text. GPU support is given in command line interface.
<!-- colab integratation on running the model on custom input cli script -->Python interface
<!-- colab integratation on running the model on custom input python script -->The python interface is useful in case you want to reuse the model for multiple transliterations and do not want to reinitialize the model each time.
Details of models and hyperparameters
<!-- network and training details and link to the paper -->- Architecture: IndicXlit uses 6 encoder and decoder layers, input embeddings of size 256 with 4 attention heads and feedforward dimension of 1024, with a total of 11M parameters.
- Loss: Cross-Entropy loss
- Optimizer: Adam
- Adam-betas: (0.9, 0.98)
- Peak-learning-rate: 0.001
- Learning-rate-scheduler: inverse-sqrt
- Temperature-sampling (T): 1.5
- Warmup-steps: 4000
Please refer to section 6 of our paper for more details on training setup.
Training model
Setting up your environment
<details><summary> Click to expand </summary># Clone IndicXlit repository
git clone https://github.com/AI4Bharat/IndicXlit.git
# Install required libraries
pip install indic-nlp-library
# Install Fairseq from source
git clone https://github.com/pytorch/fairseq.git
cd fairseq
pip install --editable ./
</details>
Training procedure and code
The high level steps we follow for training are as follows:
- Organize the train/test/valid data in corpus directory such that it has all the files containing parallel data for en-X (English-X) language pairs in the following format:
- train_x.en for training file of en-X language pair which contains the space separated Roman characters in each line.
- train_x.x for training file of en-X language pair which contains the space separated Indic language characters in each line.
# corpus/
# ├── train_bn.bn
# ├── train_bn.en
# ├── train_gu.gu
# ├── train_gu.en
# ├── ....
# ├── valid_bn.bn
# ├── valid_bn.en
# ├── valid_gu.gu
# ├── valid_gu.en
# ├── ....
# ├── test_bn.bn
# ├── test_bn.en
# ├── test_gu.gu
# ├── test_gu.en
# └── ....
- Combine the training files (joint training) across all languages.
# corpus/
# ├── train_combine.cmb
# └── train_combine.en
- Create the joint vocabulary using all the combined training data.
fairseq-preprocess \
--trainpref corpus/train_combine \
--source-lang en --target-lang cmb \
--workers 256 \
--destdir corpus-bin
- Create the binarized data required for Fairseq for each language separately using joint vocabulary.
for lang_abr in bn gu hi kn ml mr pa sd si ta te ur
do
fairseq-preprocess \
--trainpref corpus/train_$lang_
