MarathiNLP
Marathi NLP - is a repository dedicated to development of tools and resources for Marathi language.
Install / Use
/learn @l3cube-pune/MarathiNLPREADME
L3Cube-MahaNLP
Despite being the third most popular language in India, the Marathi language lacks useful NLP resources. With <a href='https://arxiv.org/abs/2205.14728'> L3Cube-MahaNLP</a>, we aim to build resources and a library for Marathi natural language processing. We have contributed un-supervised, supervised datasets, and Transformer models for Marathi. The supervised datasets include Marathi sentiment analysis, named entity recognition, and hate speech detection. With this, we at L3Cube-Pune aim to bring Marathi to the forefront of IndicNLP. Our vision is to make Marathi a resource-rich language and promote AI for Maharashtra!
[Update] The library is now available in a python package:
pip install mahaNLP
Usage examples are provided in this demo <a href='https://colab.research.google.com/drive/1POx3Bi1cML6-s3Z3u8g8VpqzpoYCyv2q'> Colab </a>.
[Update] We have released a new code-mixed Marathi-English unsupervised dataset MeCorpus and supervised datasets like MeSent, MeHate, and MeLID. <br> [Update] We have released a new multi-domain Sentiment analysis dataset MahaSent-MD with 60k samples across four diverse domains. A new sentiment analysis <a href='https://huggingface.co/l3cube-pune/marathi-sentiment-md'>model</a> is also released on HF.
L3Cube-MahaCorpus and Marathi BERT
L3Cube-MahaCorpus is a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens. We also present, MahaBERT, MahaAlBERT, and MahaRoBerta all BERT-based masked language models, and MahaFT, the fast text word embeddings both trained on full Marathi corpus with 752M tokens. The evaluation details are mentioned in our paper <a href='https://arxiv.org/abs/2202.01159'> link </a>
Dataset Statistics
L3Cube-MahaCorpus(full) = L3Cube-MahaCorpus(news) + L3Cube-MahaCorpus(non-news)
Full Marathi Corpus incorporates all existing <a href='https://github.com/AI4Bharat/indicnlp_corpus'> sources </a>. |Dataset|#tokens(M)|#sentences(M)|Link| |:--------:|:----:|:----:|:----:| |L3Cube-MahaCorpus (news)|212|17.6|<a href='https://drive.google.com/file/d/1gLI38-YdvapattwxC3z46Fgzif7j8_Ji/view?usp=sharing'> link </a>| |L3Cube-MahaCorpus (non-news)|76.4|7.2|<a href='https://drive.google.com/file/d/1KHHJByCFwJxMJaGkO3FjIQbkLbc7rHAQ/view?usp=sharing'> link </a>| |L3Cube-MahaCorpus (full)|289|24.8|<a href='https://drive.google.com/file/d/1sHIIq7C-WA6nSQaoVr4uL6pas8MVNmAr/view?usp=sharing'> link </a>| |Full Marathi Corpus (all sources)|752|57.2|<a href='https://drive.google.com/file/d/1UjZ-X2S77AQyCkHqw2mFXRWYf9WOZS0m/view?usp=sharing'> link </a>|
L3Cube-MeCorpus and code-mixed MeBERT
L3Cube-MeCorpus is a first-of-its-kind large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences released in <a href='https://arxiv.org/abs/2306.14030'> paper </a>.
|Dataset|#tokens(M)|#sentences(M)|Link| |:--------:|:----:|:----:|:----:| |L3Cube-MeCorpus (Roman)|70.9|5|<a href='https://drive.google.com/file/d/1yyoANEBbj6keb0zfcXTpKmw0bueKy_fa/view?usp=sharing'> link </a>| |L3Cube-MeCorpus (Devanagari)|68.6|5|<a href='https://drive.google.com/file/d/1WNVXj6KZ0_kgr-wYb5CBUbaL4dlM3QVa/view?usp=sharing'> link </a>| |L3Cube-MeCorpus (Roman + Devanagari)|139.5|10|<a href='https://drive.google.com/file/d/1fvDEVlb1SCaxqUl3Xu5LYzAJWLzHeLy0/view?usp=sharing'> link </a>|
Marathi BERT models and Marathi Fast Text model
The full Marathi Corpus is used to train BERT language models and made available on Hugging Face model hub. |Model|Description|Link| |:--------:|:----:|:----:| |MahaGemma-7B|Gemma-7B|<a href='https://huggingface.co/l3cube-pune/marathi-gpt-gemma-7b'> v1 </a>| |MahaGemma-2B|Gemma-2B|<a href='https://huggingface.co/l3cube-pune/marathi-gpt-gemma-2b'> v1 </a>| |MahaBERT|Base-BERT|<a href='https://huggingface.co/l3cube-pune/marathi-bert'> v1 </a>, <a href='https://huggingface.co/l3cube-pune/marathi-bert-v2'> v2 </a>, <a href='https://arxiv.org/abs/2202.01159'> paper </a>| |MahaRoBERTa|RoBERTa|<a href='https://huggingface.co/l3cube-pune/marathi-roberta'> link </a>| |MahaAlBERT|AlBERT|<a href='https://huggingface.co/l3cube-pune/marathi-albert'> v1 </a>, <a href='https://huggingface.co/l3cube-pune/marathi-albert-v2'> v2 </a>| |MahaGPT|GPT2|<a href='https://huggingface.co/l3cube-pune/marathi-gpt'> link </a>| |MahaFT|Fast Text|<a href='https://drive.google.com/file/d/1xuQPMUIFvjgQranChgJ3alHXMJVeCVz0/view?usp=sharing'> bin </a>, <a href='https://drive.google.com/file/d/1-2rCOsgxKgTigonta4FvA4WBWIaXVX73/view?usp=sharing'> vec </a>| |MahaTweetBERT|MahaBERT + Tweets|<a href='https://huggingface.co/l3cube-pune/marathi-tweets-bert'> model </a>, <a href='https://arxiv.org/abs/2210.04267'> paper </a>| |MahaSBERT|Sentence-BERT|<a href='https://huggingface.co/l3cube-pune/marathi-sentence-similarity-sbert'> MahaSBERT-STS </a>, <a href='https://huggingface.co/l3cube-pune/marathi-sentence-bert-nli'> MahaSBERT </a> , <a href='https://arxiv.org/abs/2211.11187'> paper </a>| |IndicSBERT|Sentence-BERT (for cross-language) |<a href='https://huggingface.co/l3cube-pune/indic-sentence-similarity-sbert'> IndicSBERT-STS </a>, <a href='https://huggingface.co/l3cube-pune/indic-sentence-bert-nli'> IndicSBERT </a> , <a href='https://arxiv.org/abs/2304.11434'> paper </a>| |MeBERT|Codemixed Marathi-English BERT (Roman) |<a href='https://huggingface.co/l3cube-pune/me-bert'> me-bert </a>, <a href='https://arxiv.org/abs/2306.14030'> paper </a>| |MeRoBERTa|Codemixed Marathi-English RoBERTa (Roman) |<a href='https://huggingface.co/l3cube-pune/me-roberta'> me-roberta </a>, <a href='https://arxiv.org/abs/2306.14030'> paper </a>| |MeBERT-Mixed|Codemixed Marathi-English BERT (Roman + Devanagari) |<a href='https://huggingface.co/l3cube-pune/me-bert-mixed'> me-bert-mixed </a>, <a href='https://huggingface.co/l3cube-pune/me-bert-mixed-v2'> me-bert-mixed-v2 </a>, <a href='https://arxiv.org/abs/2306.14030'> paper </a>| |MeRoBERTa-Mixed|Codemixed Marathi-English RoBERTa (Roman + Devanagari) |<a href='https://huggingface.co/l3cube-pune/me-roberta-mixed'> me-roberta-mixed </a>, <a href='https://arxiv.org/abs/2306.14030'> paper </a>|
Supervised Datasets
|Dataset|Description|Samples(train, valid, test)|link|model|paper| |:--------:|:----:|:----:|:----:|:----:|:----:| MahaSQuAD|Marathi Question Answering Dataset|142k (118516, 11873, 11803)|<a href='https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaSQuAD'> data </a>|<a href='https://huggingface.co/l3cube-pune/marathi-question-answering-squad-bert'> MahaSQuAD-BERT </a>|<a href='https://arxiv.org/abs/2404.13364'> link </a>| MahaNews|Marathi long, medium, and short document classification dataset in Marathi dataset with 12 target classes|53k (42k, 5k, 5k)|<a href='https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaNews'> data </a>|<a href='https://huggingface.co/l3cube-pune/marathi-topic-all-doc'> MahaNews-All-BERT </a>|<a href='https://arxiv.org/abs/2404.18216'> link </a>| MahaNER|Marathi Named Entity Recognition dataset with 8 entity classes|25k (21.5k, 1.5k, 2k)|<a href='https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaNER'> data </a>|<a href='https://huggingface.co/l3cube-pune/marathi-ner'> MahaNER-BERT </a>|<a href='https://arxiv.org/abs/2204.06029'> link </a>| MahaSocialNER|Social media based Marathi Named Entity Recognition dataset with 8 entity classes|18k (12k, 1.5k, 2.2k)|<a href='https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaSocialNER'> data </a>|<a href='https://huggingface.co/l3cube-pune/marathi-social-ner'> MahaSocialNER-BERT </a>|<a href='http://arxiv.org/abs/2401.00170'> link </a>| MahaHate|Marathi Hate Speech Detection dataset with 4 class (hate, offensive, pofane, and not) and 2 class (hate and not) labels|4-class: 25k (21.5k, 1.5k, 2k), 2-class: 37500|<a href='https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaHate'> data </a>|<a href='https://huggingface.co/l3cube-pune/mahahate-multi-roberta'> 4-class </a> , <a href='https://huggingface.co/l3cube-pune/mahahate-bert'> 2-class </a>|<a href='https://arxiv.org/abs/2203.13778'> link </a>| MahaSent|Marathi Sentiment Analysis dataset with three classes - Positive(1), Negative(-1) and Neutral(0)|18,378 (12114, 1500, 2250); extra(2,514=2355(+1) + 159(-1))|<a href='https://github.com/l3cube-pune/MarathiNLP/tree/main/L3CubeMahaSent%20Dataset'> data </a>|<a href='https://huggingface.co/l3cube-pune/MarathiSentiment'> MarathiSentiment </a>|<a href='https://arxiv.org/abs/2103.11408'> link </a>| HateEval-Mr|Another dataset for evaluation of Hate Speech models with two classes - Hate(1) and None(0)|2k samples|<a href='https://github.com/l3cube-pune/MarathiNLP/tree/main/HateEval'> data| |<a href='https://arxiv.org/abs/2210.04267'> link </a>| MahaSent-MD|A Multi-domain Marathi Sentiment Analysis dataset (4 domains - Marathi Movie Reviews, TV Subtitles, Generic Tweets, and Political Tweets) with three classes - Positive(1), Negative(-1) and Neutral(0)|60k samples|<a href='https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaSent-MD/'> data| <a href='https://huggingface.co/l3cube-pune/marathi-sentiment-md'>MahaSent-MD</a> |<a href='https://arxiv.org/abs/2306.13888'> link </a>| MeSent|A code-mixed Marathi-English Sentiment Analysis dataset with three classes - Positive(1), Negative(-1) and Neutral(0)|12k samples|<a href='https://github.com/l3cube-pune/MarathiNLP/tree/main/MeEval/L3Cube-MeSent'> data| <a href='https://huggingface.co/l3cube-pune/me-sent-roberta'>me-sent-roberta</a> |<a href='https://arxiv.org/abs/2306.14030'> link </a>| MeHate|A code-mixed Marathi-English Hate speech identification dataset with two classes - Hate(1) and None(0)|2768 samples|<a href='https://github.com/l3cube-pune/MarathiNLP/tree/main/MeEval/L3Cube-MeHate'> data| <a href='https://huggingface.co/l3cube-pune/me-hate-bert'>me-hate-bert</a> |<a href='https://arxiv.org/abs/2306.14030'> link </a>| MeLID|A code-mixed Marathi-Englis
Related Skills
node-connect
353.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
353.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
353.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
