Amthāl: A Dataset and Computational Model of the Qur’an’s Conceptual Universe

About The Project

The Amthal Project provides the data and code for a large-scale computational analysis of the Qur'an's conceptual universe. This repository contains a new, richly annotated corpus of 4,078 figurative instances, validated through a rigorous inter-coder reliability protocol, alongside a fully reproducible workflow for modeling the text's ideological architecture as a network. The provided scripts allow for the replication of our core finding: the structural centrality of the PATH, POWER, and COGNITION triad.

Grounded in the ethos of Open Science, this project makes all its scholarly products transparently available to facilitate verification and extension. A manuscript detailing the full theoretical and analytical findings is currently under peer review; a citation will be added upon its publication.

Ethical Considerations & Usage Notice

The data and models in this repository engage with the Qur'an, a text of profound religious and cultural significance to communities worldwide. We have developed these resources with a commitment to scholarly accountability and methodological transparency. We ask that all users engage with them in a similar spirit.

We encourage responsible use aligned with ethical, respectful, and culturally sensitive research practices. We propose the following principles, drawn from critical data studies and digital humanities, as a guide for users:

Data as Representation, Not Reality: This dataset is a model of the Qur'an's figurative language, not the text itself. It is a form of capta (data actively constructed through scholarly judgment), not objective data. We urge users to avoid computational reductionism and to remember that any analysis of this data is an analysis of one specific, theoretically-grounded representation.
Contextual and Cultural Sensitivity: The meanings encoded in this dataset are deeply embedded in historical, theological, and linguistic contexts. We caution against making decontextualized or universalist claims. Any interpretation should acknowledge the rich exegetical traditions and interpretive communities that surround the Qur'an.
Reflexivity and Positionality: As researchers, our own backgrounds and theoretical commitments shape our work. We encourage users to reflect on their own positionality and how it influences their analysis and interpretation of these resources.

The computational models presented here are intended as analytical tools, not as theological claims. They are designed to reveal structural patterns to facilitate new scholarly questions, not to provide definitive answers or replace traditional hermeneutics. We invite users to engage with these resources in a spirit of critical inquiry and intellectual humility.

Dataset & Documentation

At the core of this project is the Amthal Corpus, a new, richly annotated dataset of figurative language in the Qur'an. It serves as the empirical foundation for our computational analysis.

Dataset Overview

Content: 4,078 manually annotated figurative instances.
Annotation Depth: 25+ fields covering conceptual, rhetorical, and affective dimensions.
Validation: Annotation quality was ensured via a formal Inter-Coder Reliability (ICR) protocol.
Location: The primary data files (instances.csv, relations.csv) are located in the /data/processed/ directory.

Comprehensive Documentation

For a complete understanding of the dataset and our methodological commitments, please refer to the following documents:

Datasheet: Provides a high-level overview of the project's motivation, composition, ethical considerations, and recommended use cases. It is essential reading for anyone intending to use these resources.

➡️ Read the Full Datasheet for the Amthal Corpus
Codebook (Data Dictionary): Offers a detailed, field-by-field description of the dataset, including all annotation categories, decision rules, and examples.

➡️ View the Full Codebook in /data/README.md

Reproducibility Summary

This project adheres to best practices for scientific reproducibility. We have taken the following steps to ensure our analysis is transparent and verifiable:

✅ All analysis scripts are provided in the /code/ directory.
✅ Random seeds are fixed in stochastic processes (e.g., sampling) to ensure identical results.
✅ A complete environment file (requirements.txt) is included for dependency management.
✅ A detailed codebook is provided in /data/README.md to explain the dataset.
✅ The full analysis pipeline is documented, with direct links to interactive notebooks.

Reproducing the Analysis

Our entire analytical workflow is documented and executable across a series of focused, self-contained notebooks. Each notebook allows for the full replication of a specific analysis or visualization from our study.

You can run these notebooks locally after setting up the environment, or explore them interactively online using Google Colab. Below, we have organized the notebooks thematically to guide you through the different components of our research.

1. Core Network Construction & Structural Analysis

This set of notebooks focuses on building the Qur'an's conceptual network and analyzing its fundamental topological properties.

1.1. Network Building & Visualization

| Analysis / Visualization | Description | Explore in Colab | | :------------------------------------- | :------------------------------------------------------------------------------------------------ | :---------------------------------------------------------------------------------- | | Global Conceptual Network | Constructs and visualizes the main weighted co-occurrence network. | <a href="https://colab.research.google.com/drive/18e7N4VWuzkfjqKjO1gGyMKkg9NYQK_mc?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> | | Network Adjacency Matrix Heatmap | Provides an alternative view of the network's structure by visualizing its adjacency matrix. | <a href="https://colab.research.google.com/drive/1Ur2rGKgAYbWwAr2UUpPVJXrkasO8CTxt?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> | | Top 15 Strongest Conceptual Pairings | Creates a focused subgraph visualizing only the 15 most frequent co-occurrence links. | <a href="https://colab.research.google.com/drive/1k9u9Ddkedisc0upYwu1El5zT7S_Ox3iM?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |

1.2. Centrality & Prominence

| Analysis / Visualization | Description | Explore in Colab | | :------------------------------------- | :------------------------------------------------------------------------------------------------ | :---------------------------------------------------------------------------------- | | Top 10 Central Conceptual Domains | Calculates and tabulates the top 10 most influential concepts based on centrality scores. | <a href="https://colab.research.google.com/drive/1SjOmwBSnV90Cbril4jPxpxU-gFzD8_IG?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> | | Weighted Degree Centrality Distribution| Plots the distribution of weighted degree scores to show the network's hierarchical structure. | <a href="https://colab.research.google.com/drive/1OoXjektz4F9t_8tp4lano0eVwgz7EuAw?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> | | Centrality vs. Textual Prominence | Creates a scatter plot comparing a concept's network role against its textual frequency. | <a href="https://colab.research.google.com/drive/1FI2__lgj6CINr2dSw7eYAbighd_rnYX4?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |

2. Community Structure & Thematic Analysis

These notebooks explore the network's thematic clusters, both algorithmically detected and predefined.

Amthal

Install / Use

README