Sima
Multi-Label Classification of Qur’anic Similes using Arabic Transformer Models Sima is a research-driven project that bridges Classical Arabic Rhetoric (Balāghah) and modern Natural Language Processing (NLP). This repository contains the code and expert-annotated dataset for identifying overlapping rhetorical categories in Qur’anic similes.
Install / Use
/learn @NoorBayan/SimaREADME
Sima
Multi-Label Classification of Qur’anic Similes using Arabic Transformer Models
Sima is a research-driven project that bridges Classical Arabic Rhetoric (Balāghah) and modern Natural Language Processing (NLP). This repository contains the code and expert-annotated dataset for identifying overlapping rhetorical categories in Qur’anic similes.
📌 Project Overview
Unlike traditional single-label classification, this project treats Qur’anic simile (Tashbīh) classification as a Multi-Label Learning (MLL) task. We account for the "rhetorical overlap" where a single verse can embody multiple categories simultaneously (e.g., being both Mursal and Tamthīlī).
Key Features:
- Expert-Annotated Dataset: 364 verses grounded in classical exegeses (Al-Kashshāf and Al-Tahrīr wa-al-Tanwīr).
- High Label Density: 712 total label assignments with an average of 1.96 labels per verse.
- State-of-the-Art Models: Fine-tuned Arabic Transformers including MARBERT, AraBERT, and CamelBERT.
- Results: MARBERT achieved a Micro F1-score of 0.7685 and a Macro F1-score of 0.6003.
📊 Dataset Statistics
The dataset covers 6 classical rhetorical categories:
- Mursal (Explicit)
- Mujmal (Concise)
- Tamthili (Representational)
- Baligh (Eloquent)
- Muakkad (Confirmed)
- Mufassal (Detailed)
| Statistic | Value | | :--- | :--- | | Total Verses | 364 | | Total Label Assignments | 712 | | Label Density | 1.96 | | Multi-label Percentage | ~72% |
🚀 Implementation & Reproducibility
To facilitate the reproduction of the experimental results and the fine-tuning of the three Transformer models (MARBERT, AraBERT, and CamelBERT), we provide a ready-to-use Google Colab notebook:
Note: You can download the notebook from the link above and upload it to your Google Colab environment to run the training and evaluation pipeline.
📚 Related Resources: The Burhan Corpus
For a more comprehensive study of Qur’anic figurative language, we recommend the Burhan Corpus. This repository hosts an extensive and detailed dataset that covers both similes (Tashbīh) and metaphors (Istiʿāra), providing full linguistic data and granular annotations for advanced rhetorical analysis.
🔗 Access Burhan Repository: https://github.com/NoorBayan/Burhan
