CACCHT
The repository contains scripts for parsing and analyzing Hebrew texts.
Install / Use
/learn @ETCBC/CACCHTREADME
CACCHT: Creating Annotated Corpora of Classical Hebrew Texts
The CACCHT project is a collaboration of Martijn Naaijer (University of Zurich), Willem van Peursen (Vrije Universiteit Amsterdam), Oliver Glanz (Andrews University), Christian Canu Højgaard (Fjellhaug International University College), Martin Ehrensvärd (University of Copenhagen) and Robert Rezetko (University of Copenhagen).
Together with specialists in the field we develop linguistically annotated datsets of Semitic texts. These datasets are publicly available and can be used freely for research and education. Some datasets have only word-level annotations, while others also contain syntactic features.
Datasets
We are working on the following datasets:
- The Dead Sea Scrolls
- The ETCBC Syriac Corpus
- The Samaritan Pentateuch
- The Copenhagen Ugaritic Corpus
- The Septuagint
Text-Fabric
All the datasets are Text-Fabric datasets and can be accessed and used with Python.
BHSA
There is an important role for the Biblia Hebraica Stuttgartensia Amstelodamensis (BHSA) in this project. The BHSA is the dataset of the Masoretic Text of the Hebrew Bible with linguistic annotations that is developed and maintained by the ETCBC. In general, CACCHT follows the annotation conventions of the BHSA and we adapt them for the specific characteristics of a language or text.
MT SP Parallels
Here you can examine MT and SP verses in parallel. The texts and features are based on the BHSA and the CACCHT SP datasets.
Publications
The following papers are written to clarify our way of making linguistic annotations and the use of the datasets.
Naaijer, M., Sikkel, C., Coeckelbergs, M., Attema, J., and Van Peursen, W.Th. (2023). A Transformer-based parser for Syriac morphology. In Proceedings of the Ancient Language Processing Workshop, Varna, Bulgaria, 23–29. https://aclanthology.org/2023.alp-1.3.pdf
Naaijer, M., Højgaard, C. C., Schorch, S., & Ehrensvärd, M. (2024). Text-Fabric Dataset of the Samaritan Pentateuch. Research Data Journal for the Humanities and Social Sciences, 9(1), 1-13. https://doi.org/10.1163/24523666-bja10051
Cantanhêde, S. d. O., Naaijer, M., Højgaard, C. C., & Glanz, O. (2026). Identifying Phrase Boundaries in the Samaritan Pentateuch with Machine Learning. Religions, 17(2), 192. https://doi.org/10.3390/rel17020192
