<div align="center"> <h1><b><i>IndicXlit</i></b></h1> <a href="https://ai4bharat.iitm.ac.in/areas/transliteration/">Website</a> | <a href="#download-indicxlit-model">Downloads</a> | <a href="https://arxiv.org/abs/2205.03018">Paper</a> | <a href="https://xlit.ai4bharat.org/">Demo</a> | <a href="https://pypi.org/project/ai4bharat-transliteration">Python Library</a> | <a href="https://www.npmjs.com/package/@ai4bharat/indic-transliterate">JavaScript Library</a> <br><br> </div>

IndicXlit is a transformer-based multilingual transliteration model (~11M) that supports 21 Indic languages for Roman to native script and native to Roman script conversions. It is trained on Aksharantar dataset which is the largest publicly available parallel corpus containing 26 million word pairs spanning 20 Indic languages at the time of writing (5 May 2022). It supports the following 21 Indic languages:

|  |  |  |  |  |  | | -------------- | -------------- | -------------- | --------------- | -------------- | ------------- | | Assamese (asm) | Bengali (ben) | Bodo (brx) | Gujarati (guj) | Hindi (hin) | Kannada (kan) | | Kashmiri (kas) | Konkani (gom) | Maithili (mai) | Malayalam (mal) | Manipuri (mni) | Marathi (mar) | | Nepali (nep) | Oriya (ori) | Panjabi (pan) | Sanskrit (san) | Sindhi (snd) | Sinhala (sin) | | Tamil (tam) | Telugu (tel) | Urdu (urd) |

Evaluation Results

IndicXlit is evaluated on Dakshina benchmark and Aksharantar benchmark. IndicXlit achieves state-of-the-art results on the Dakshina testset and also provides baseline results on the new Aksharantar testset. The Top-1 results are summarized below. For more details, refer our paper.

En-Indic Results

| Languages | asm | ben | brx | guj | hin | kan | kas | kok | mai | mal | mni | mar | nep | ori | pan | san | tam | tel | urd | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Dakshina | - | 55.49 | - | 62.02 | 60.56 | 77.18 | - | - | - | 63.56 | - | 64.85 | - | - | 47.24 | - | 68.10 | 73.38 | 42.12 | 61.45 | | Aksharantar (native words) | 60.27 | 61.70 | 70.79 | 61.89 | 55.59 | 76.18 | 28.76 | 63.06 | 72.06 | 64.73 | 83.19 | 63.72 | 80.25 | 58.90 | 40.27 | 78.63 | 69.78 | 84.69 | 48.37 | | Aksharantar (named entities) | 38.62 | 37.12 | 30.32 | 48.89 | 58.87 | 49.92 | 20.23 | 34.36 | 42.82 | 33.93 | 44.12 | 53.57 | 52.67 | 30.63 | 36.08 | 24.06 | 42.12 | 51.82 | 47.77 |

Indic-En Results

| Languages | asm | ben | brx | guj | hin | kan | kas | kok | mai | mal | mni | mar | nep | ori | pan | san | tam | tel | urd | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Aksharantar (native words) | 75.55 | 11.76 | 68.58 | 34.35 | 52.30 | 76.25 | 55.94 | 38.96 | 65.04 | 65.55 | 84.64 | 36.15 | 82.38 | 53.65 | 29.05 | 67.4 | 45.08 | 54.55 | 29.95 | | Aksharantar (named entities) | 37.86 | 52.54 | 35.49 | 50.56 | 59.22 | 60.77 | 12.87 | 35.09 | 38.18 | 45.23 | 34.87 | 56.76 | 54.05 | 47.68 | 48.00 | 42.71 | 35.46 | 57.57 | 23.14 |

Zero-shot results on Dogri (using hi prefix token)

Aksharantar (native words): 37.72

Table of contents
Resources
Running inference
- Command line interface
- Python interface
Training model
Finetuning model on your data
Mining details
Directory structure
Citing
Acknowledgements

Resources

Download IndicXlit model

Roman to Indic model v1.0

Indic to Roman model v1.0

Download Aksharantar Dataset

Aksharantar-Dataset: Huggingface

Download Aksharantar test set

Transliteration-word-pairs

Transliteration-sentence-pairs

Using hosted APIs

Roman to Indic Interface

Indic to Roman Interface

<details><summary> Click to expand </summary>

Sample screenshot of sentence transliteration

Select the language from drop-down list given at top left corner: <br>

To transliterate into Hindi, select Hindi from the list and enter your sentence in the "text" field: <br>

Accessing on ULCA

You can try out our model at ULCA under Transliteration Models, and the Aksharantar dataset under Transliteration Benchmark Datasets.

Running Inference

Command line interface

The model is trained on words as inputs. Hence, users need to split sentences into words before running the transliteration model when using our command line interface.

Follow the Colab notebook to setup the environment, download the trained IndicXlit model and transliterate your own text. GPU support is given in command line interface.

Command line interface -->

Python interface

Python interface -->

The python interface is useful in case you want to reuse the model for multiple transliterations and do not want to reinitialize the model each time.

Details of models and hyperparameters

Architecture: IndicXlit uses 6 encoder and decoder layers, input embeddings of size 256 with 4 attention heads and feedforward dimension of 1024, with a total of 11M parameters.
Loss: Cross-Entropy loss
Optimizer: Adam
Adam-betas: (0.9, 0.98)
Peak-learning-rate: 0.001
Learning-rate-scheduler: inverse-sqrt
Temperature-sampling (T): 1.5
Warmup-steps: 4000

Please refer to section 6 of our paper for more details on training setup.

Training model

Setting up your environment