ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic

What is the repository is about?

This is the repository accompanying our ACL 2021 paper ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic. In the paper, we:

introduce ARBERT and MARBERT, two powerful Transformer-based language models for Arabic;
introduce ArBench, a multi-domain, multi-variety benchmark for Arabic naturaal language understanding based on 41 datasets across 5 different tasks and task clusters;
evaluate ARBERT and MARBERT on ArBench and compare against available language models.

Our models establish new state-of-the-art (SOTA) on all 5 tasks/task clusters on 37 out of the 41 datasets. Our language models are publicaly available for research (see below). The rest of this repository provides more information about our new language models, benchmark, and experiments.

1 Our Language Models
- 1.1 ARBERT & MARBERT
- 1.2 Training Data and Vocabulary
2. Our Benchmark: ArBench
3. Model Evaluation
4. How to use ARBERT and MARBERT
5. Ethics
6. Download ARBERT and MARBERT Checkpoints
7. Citation
8. Acknowledgments

1. Our Language Models

1.1 ARBERT & MARBERT

ARBERT is a large scale pre-training masked language model focused on Modern Standard Arabic (MSA). To train ARBERT, we use the same architecture as BERT-base: 12 attention layers, each has 12 attention heads and 768 hidden dimensions, a vocabulary of 100K WordPieces, making ∼163M parameters. We train ARBERT on a collection of Arabic datasets comprising 61GB of text (6.2B tokens)

MARBERT is a large scale pre-training masked language model focused on both Dialectal Arabic (DA) and MSA. Arabic has multiple varieties. To train MARBERT, we randomly sample 1B Arabic tweets from a large in-house dataset of about 6B tweets. We only include tweets with at least 3 Arabic words, based on character string matching, regardless whether the tweet has non-Arabic string or not. That is, we do not remove non-Arabic so long as the tweet meets the 3 Arabic word criterion. The dataset makes up 128GB of text (15.6B tokens). We use the same network architecture as ARBERT (BERT-base), but without the next sentence prediction (NSP) objective since tweets are short. See our repo for modifying BERT code to remove NSP.

1.2 Training Data and Vocabulary

The following table shows a comparison between ARBERT and MARBERT, on the one hand, and mBERT, XLM-R, and AraBERT, on the other hand. We compare in terms of pre-training data sources and size, vocabulary size, and model parameter size.

| | Data Source | #Tokens(ar/all) | Tokanization | Vocab Size(ar/all) | Cased | Arch. | #Param | |---------|---------------------|----------------|---------------|--------------|-------|---------------|--------| | mBERT | Wikipedia | 153M/1.5B | WordPiece | 5K/110K | yes | base | 110M | | XLM-RB | CommonCraw | l2.9B/295B | SentencePiece | 14K/250K | yes | base | 270M | | XLM-RL | CommonCraw | l2.9B/295B | SentencePiece | 14K/250K | yes | large | 550M | | AraBERT | Several (3 sources) | 2.5B/2.5B | SentencePiece | 60K/64K | no | base | 135M | | ARBERT | Several (6 sources) | 6.2B/6.2B | WordPiece | 100K/100K | no | base | 163M | | MARBERT | Arabic Twitter | 15.6B/15.6B | WordPiece | 100K/100K | no | base | 163M |

2. Our Benchmark: ArBench

To evaluate our models, we also introduce ArBench, a new benchmark for multi-dialectal Arabic language understanding. ArBench is built using 41 datasets targeting 5 different tasks/task clusters, allowing us to offer a series of standardized experiments under rich conditions. The following are the different tasks/task clusers covered by ArBench:

2.1 Sentiment Analysis

|Reference| Data (#classes) | TRAIN | DEV | TEST | |---------|--------|--------|-------|------| |Alomari et al. (2017)|AJGT (2) | 1.4K | - | 361 | |Abdul-Mageed et al. (2020b) |AraNETSent (2) | 100K | 14.3K | 11.8K | |Al-Twairesh et al. (2017)|AraSenTi (3) | 11,117 | 1,407 | 1,382 | |Abu Farha and Magdy (2017)|ArSarcasmSent (3) | 8.4K | - | 2.K | |Elmadany et al. (2018)|ArSAS (3) | 24.7K | - | 3.6K | |Baly et al. (2019)|ArsenTD-LEV (5) | 3.2K | - | 801 | |Nabil et al. (2015)|ASTD (3) | 24.7K | - | 664 | |Nabil et al. (2015)|ASTD-B (2) | 1.06K | - | 267 | |Abdul-Mageed and Diab (2012)|AWATIF (4) | 2.28K | 288 | 284 | |Salameh et al. (2015)|BBN (3) | 960 | 125 | 116 | |Elnagar et al. (2018) |HARD (2) | 84.5K | - | 21.1K | |Nabil et al. (2015)|LABR (2) | 13.1K | | 3.28K | |Abdul-Mageed and Diab (2014)|SAMAR (5) | 2.49K | 310 | 316 | |Rosenthal et al. (2017)|SemEval (3) | 24.7K | - | 6.10K | |Salameh et al. (2015)|SYTS(3) | 960 | 202 | 199 | |Saad (2019)|TwitterSaad (2) | 1.5K | 202 | 190 | | |Abdullah et al. (2013)|TwitterAbdullah (2) | 46k | 5.77k | 5.82k |

2.2 Social Meaning

|Reference| Task |Data (#classes) | TRAIN | DEV | TEST | |----------------------| ---------------|---------|--------|--------|-------| |Zaghouani and Charfi (2018)|Age| Arap-Tweet (3) | 1.28M | 160K | 160K | |Zaghouani and Charfi (2018)| Gender| Arap-Tweet (2) | 1.28M | 160K | 160K | |Abdul-Mageed et al. (2020b)| Emotion | AraNETEmo (8) | 189K | 911 | 942 | |Abu Farha and Magdy (2017)|Sarcasm | AraSarcasm (2) | 8.4K | - | 2.1K | |Alshehri et al. (2020a)| Dangerous |AraDang (2) | 3.4K | 616 | 664 | |Ghanem et al. (2019)| Irony | FIRE2019 (2) | 3.6K | - | 404 | |Mubarak et al. (2020)| Offensive | OSACT-A (2) | 10K | 1K | 2K | |Mubarak et al. (2020)| Hate Speech| OSACT-B - (2) | 10K | 1K | 2K |

2.3 Topic Classification

| Reference | Data (#classes) | TRAIN | DEV | TEST | |-------------------------------------|---------|--------|--------|-------| |Saad and Ashour (2010) | OSAC (10) | 17.9K | 2.24K | 2.24K | |Abbas et al. (2011) | Khallej (4) | 4.55K | 570 | 570 | |Chouigui et al. (2017) | ANT(5) | 25.2K | 31.5K | 31.5K |

2.4 Dialect Identification

|Reference| Data (#classes) | Task | TRAIN | DEV | TEST | |-----------|:--

Marbert

Install / Use

README