TrOCR Marathi Printed

This project is about using TrOCR, a Transformer-based Optical Character Recognition model, to recognize text from scanned Marathi documents. TrOCR is a state-of-the-art model that leverages both computer vision and natural language processing to achieve high accuracy and robustness on various text recognition tasks.

Dataset

This repository includes two types of datasets, both originating from scanned Marathi documents. One dataset is organized at the line level, while the other is at the word level

Line-Level Dataset

The line-level dataset consists of 2671 PNG images of lines, along with a CSV file output.csv that contains two columns: one for the image file name and the other for the corresponding text. The images and the CSV file are compressed in a zip file output-zip.zip and un-processed images at raw_dataset for convenience.

Word-Level Dataset

In addition to the line-level dataset, a word-level dataset has also been uploaded at words-dataset which includes 8077 PNG images, available in both preprocessed and un-processed forms, compressed into words-preprocessed.zip and words-raw.zip files respectively. Accompanying these images is a CSV file words.csv, this file follows the same format as the CSV file for the line-level dataset, containing the image file name and the corresponding text.

Notebooks

The project contains two Jupyter notebooks: train.ipynb and test.ipynb. The train notebook shows how to fine-tune a pre-trained TrOCR model on the Marathi dataset using the Hugging Face Transformers library. The test notebook shows how to use the fine-tuned model to perform text recognition on new images and evaluate its performance.

Usage

To use this project, you need to have Python 3.6 or higher and install the required packages listed in the requirements.txt file. You also need to download the pre-trained TrOCR model from the Hugging Face model hub and save it in the models folder. Then, you can run the notebooks in your preferred environment, such as Google Colab or your local machine.

Pre-trained Models]

The project uses two pre-trained models for the vision encoder and the text decoder:

Google ViT: A Vision Transformer model that encodes an input image as a sequence of patches and applies self-attention to learn global features.
Marathi-BERT-v2: MahaBERT is a Marathi BERT model. It is a multilingual BERT (google/muril-base-cased) model fine-tuned on L3Cube-MahaCorpus and other publicly available Marathi monolingual datasets.

License

The dataset is created by me and I own the rights to it. If you want to use it for your own research or projects, you must contact me first and obtain my permission. The code and the notebooks are licensed which means you can use, modify, and distribute them freely, as long as you give credit to me and the original sources.

References

Li, M., Liu, Y., Gao, X., He, Y., Chen, W., Qiao, S., ... & Chen, X. (2021). TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models. arXiv preprint arXiv:2109.10282.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., ... & Brew, J. (2019). Huggingface's transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.

Important Links

Marathi-BERT-V2 by l3cube-pune
Hugging Face Model: marathi-bert-v2
Marathi-BERT by l3cube-pune
Hugging Face Model: marathi-bert
Marathi_BERT_V2 BertEmbeddings from l3cube-pune
More Information: Marathi_BERT_V2 BertEmbeddings by l3cube-pune
Vision Transformer by Google Research
GitHub Repository: vision_transformer
Marathi-BERT-V2 by l3cube-pune
Hugging Face Model: marathi-bert-v2
Vision Transformer by Google Research
GitHub Repository: vision_transformer
VIT-Base-Patch16-224 by Google
Hugging Face Model: vit-base-patch16-224
L3Cube-MahaCorpus and other publicly available Marathi monolingual datasets
GitHub Repository: MarathiNLP

TrOCR

Install / Use

README