Parsbert
🤗 ParsBERT: Transformer-based Model for Persian Language Understanding
Install / Use
/learn @hooshvare/ParsbertREADME
<br/><br/>
ParsBERT is a monolingual language model based on Google’s BERT architecture. This model is pre-trained on large Persian corpora with various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 3.9M documents, 73M sentences, and 1.3B words.
Paper presenting ParsBERT: DOI: 10.1007/s11063-021-10528-4
CURRENT VERSION: V3
Introduction
ParsBERT trained on a massive amount of public corpora (Persian Wikidumps, MirasText) and six other manually crawled text data from a various type of websites (BigBang Page scientific, Chetor lifestyle, Eligasht itinerary, Digikala digital magazine, Ted Talks general conversational, Books novels, storybooks, short stories from old to the contemporary era).
As a part of ParsBERT methodology, an extensive pre-processing combining POS tagging and WordPiece segmentation was carried out to bring the corpora into a proper format.
<strike><a href="http://lab.hooshvare.com/">ParsBERT Playground</a></strike>
Evaluation
ParsBERT is evaluated on three NLP downstream tasks: Sentiment Analysis (SA), Text Classification, and Named Entity Recognition (NER). For this matter and due to insufficient resources, two large datasets for SA and two for text classification were manually composed, which are available for public use and benchmarking. ParsBERT outperformed all other language models, including multilingual BERT and other hybrid deep learning models for all tasks, improving the state-of-the-art performance in Persian language modeling.
Results
The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
Sentiment Analysis (SA) task
| Dataset | ParsBERT v3 | ParsBERT v2 | ParsBERT v1 | mBERT | DeepSentiPers | |:------------------------:|:-----------:|:-----------:|:-----------:|:-----:|:-------------:| | Digikala User Comments | - | 81.72 | 81.74* | 80.74 | - | | SnappFood User Comments | - | 87.98 | 88.12* | 87.87 | - | | SentiPers (Multi Class) | - | 71.31* | 71.11 | - | 69.33 | | SentiPers (Binary Class) | - | 92.42* | 92.13 | - | 91.98 |
Text Classification (TC) task
| Dataset | ParsBERT v3 | ParsBERT v2 | ParsBERT v1 | mBERT | |:-----------------:|:-----------:|:-----------:|:-----------:|:-----:| | Digikala Magazine | - | 93.65* | 93.59 | 90.72 | | Persian News | - | 97.44* | 97.19 | 95.79 |
Named Entity Recognition (NER) Task
| Dataset | ParsBERT v3 | ParsBERT v2 | ParsBERT v1 | mBERT | MorphoBERT | Beheshti-NER | LSTM-CRF | Rule-Based CRF | BiLSTM-CRF | |:-------:|:-----------:|:-----------:|:-----------:|:-----:|:----------:|:------------:|:--------:|:--------------:|:----------:| | PEYMA | | 93.40* | 93.10 | 86.64 | - | 90.59 | - | 84.00 | - | | ARMAN | | 99.84* | 98.79 | 95.89 | 89.9 | 84.03 | 86.55 | - | 77.45 |
If you tested ParsBERT on a public dataset, and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference
How to use
from transformers import AutoConfig, AutoTokenizer, AutoModel, TFAutoModel
# v3.0
model_name_or_path = "HooshvareLab/bert-fa-zwnj-base"
config = AutoConfig.from_pretrained(model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
# model = TFAutoModel.from_pretrained(model_name_or_path) For TF
model = AutoModel.from_pretrained(model_name_or_path)
text = "ما در هوشواره معتقدیم با انتقال صحیح دانش و آگاهی، همه افراد میتوانند از ابزارهای هوشمند استفاده کنند. شعار ما هوش مصنوعی برای همه است."
tokenizer.tokenize(text)
['ما', 'در', 'هوش', '[ZWNJ]', 'واره', 'معتقدیم', 'با', 'انتقال', 'صحیح', 'دانش', 'و', 'آ', '##گاهی', '،', 'همه', 'افراد', 'میتوانند', 'از', 'ابزارهای', 'هوشمند', 'استفاده', 'کنند', '.', 'شعار', 'ما', 'هوش', 'مصنوعی', 'برای', 'همه', 'است', '.']
Derivative models
V3.0
BERT v3.0 Model
DistilBERT v3.0 Model
ALBERT v3.0 Model
ROBERTA v3.0 Model
V2.0
ParsBERT v2.0 Model
ParsBERT v2.0 Sentiment Analysis
- HooshvareLab/bert-fa-base-uncased-sentiment-digikala
- HooshvareLab/bert-fa-base-uncased-sentiment-snappfood
- HooshvareLab/bert-fa-base-uncased-sentiment-deepsentipers-binary
- HooshvareLab/bert-fa-base-uncased-sentiment-deepsentipers-multi
ParsBERT v2.0 Text Classification
ParsBERT v2.0 NER
V1.0
ParsBERT v1.0 Model
ParsBERT v1.0 NER
- HooshvareLab/bert-base-parsbert-peymaner-uncased
- HooshvareLab/bert-base-parsbert-armanner-uncased
- HooshvareLab/bert-base-parsbert-ner-uncased
NLP Tasks Tutorial :hugs:
| Notebook | |
|--------------------------|---------------------------------------------------------------------------------|
| Text Classification | |
| Sentiment Analysis |
|
| Named Entity Recognition | |
| Text Generation | |
Cite
Please cite the following paper in your publication if you are using ParsBERT in your research:
@article{ParsBERT,
title={Parsbert: Transformer-based model for Persian language understanding},
DOI={10.1007/s11063-021-10528-4},
journal={Neural Processing Letters},
author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
year={2021}
}
Acknowledgments
We hereby, express our gratitude to the Tensorflow Research Cloud (TFRC) program for providing us with the necessary computation resources. We also thank Hooshvare Research Group for facilitating dataset gathering and scraping online text resources.
Contributors
- Mehrdad Farahani: Linkedin, Twitter, Github
- Mohammad Gharachorloo: Linkedin, Twitter, Github
- Marzieh Farahani: Linkedin, Twitter, Github
- Mohammad Manthouri: Linkedin, Twitter, Github
- Hooshvare Team: Official Website, Linkedin, Twitter, Github, Instagram
Releases
v3.0 (2021-02-28)
The new version of BERT v3.0 for Persian is available today and can tackle the zero-width non-joiner character for Persian writing. Also, the model was trained on new multi-types corpora with a new set of vocabulary.
Available by: [HooshvareLab/bert-fa-zwnj-base](https://huggingface.co/Ho
Related Skills
node-connect
337.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
337.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.2kCommit, push, and open a PR

