Rababa
Rababa, the diacritization library for Arabic and Hebrew (Abjad scripts in general)
Install / Use
/learn @interscript/RababaREADME
= رُبابَة RABABA the Middle-Eastern Language Diacritization Library
Middle-Eastern Language diacritization is useful for several practical business cases like text to speech or Romanization of texts or scripts.
As of now, this library supports Hebrew and Arabic.
== Purpose
This repository contains everything to train a diacritization model in Python and run it in Python and Ruby.
== Try out Rababa
Rababa can be run both in Python and Ruby. Go the directory corresponding to the language you prefer to use.
Please see the following README's, under the "Try out Rababa" section:
- https://github.com/interscript/rababa/tree/main/python[Python]
- https://github.com/interscript/rababa/tree/main/lib[Ruby]
== Library
This library was built for the https://www.interscript.org[Interscript project] (https://github.com/interscript/[at GitHub]).
Diacritization strategy is following several steps with at heart a deep learning model:
. text preprocessing . neural networks model prediction . text postprocessing
This repository contains:
-
https://github.com/interscript/rababa/tree/main/lib[lib] is the Ruby library using NNet model in ONNX format.
-
https://github.com/interscript/rababa/tree/main/docs[docs] contains an application focused summary of latest (2021-06) relevant papers and solutions.
-
https://github.com/interscript/rababa/tree/main/python[python]
** A neural network solution for automatised diacritization based on the work of https://github.com/almodhfer/Arabic_Diacritization[almodhfer], from which we overtook the baseline and more advanced and efficient CBHG models only. This very recent solution allows for efficient predictions on CPU's with a reasonable sized model.
** PyTorch to ONNX conversion of PyTorch to ONNX format
** Strings Pre-/Post-processing, also from https://github.com/almodhfer/Arabic_Diacritization[almodhfer]
- https://github.com/interscript/rababa/tree/main/tests-benchmarks[tests and benchmarking utilities], allowing to compare with other implementations.
** tests are taken from https://github.com/AliOsm/arabic-text-diacritization[diacritization benchmarking]
** we have added own, realistic datasets for the problem of diacritization
- models-data directory to store models and embeddings in various formats
== About the name
A https://en.wikipedia.org/wiki/Rebab[Rababa] is an antique string instrument.
In a similar fashion that a Rababa produces melody from a simple strings and pieces of wood, our library and diacritization gives a whole palette of colour and meanings to arabic scripts.
== Under development
We are working on the following improvements:
- Enhancing architecture and encoding
- Enhance datasets to improve models
== License and copyright
Rababa is copyright (c) 2021-2025, Ribose Inc. All rights reserved.
Rababa is licensed under the BSD-2 Clause license. See the LICENSE.adoc file for details.
== Attributions
=== General
The Rababa team would like to express their appreciation for the open-source work of these authors and researchers:
- M. A. H. Madhfar and A. M. Qamar for their work on effective deep learning models for automatic diacritization of Arabic text
- Taha Zerrouki for the original Tashkeela dataset
The team acknowledges the contributions of these authors and researchers in the field of Arabic diacritization and recognizes the importance of their work in advancing the state of the art in this area.
Rababa does not redistribute any code or data from these attributed sources. Any redistribution of these attributed sources should be done in accordance with their respective licenses.
Rababa is not responsible for any issues that may arise from the use of these external sources. These sources are provided for reference purposes only, and their use is at the user's own risk.
=== Arabic diacritization models
The neural network solution for Arabic diacritization is based on the work of M. A. H. Madhfar:
- Repository: https://github.com/almodhfer/Arabic_Diacritization
- License: MIT License
- Citation: M. A. H. Madhfar and A. M. Qamar, "Effective Deep Learning Models for Automatic Diacritization of Arabic Text," in IEEE Access, vol. 9, pp. 273-288, 2021, doi: 10.1109/ACCESS.2020.3041676.
=== Tashkeela dataset
The Tashkeela dataset used for training is provided under GPL v2 license:
- Original dataset by Taha Zerrouki: https://sourceforge.net/projects/tashkeela/
- Processed dataset by Hamza Abbad: https://sourceforge.net/projects/tashkeela-processed/
- License: GPL v2
