Strwythura
Strwythura: construct an entity-resolved knowledge graph from structured data sources and unstructured content sources, implementing an ontology pipeline, plus context engineering for optimizing AI application outcomes within a specific domain. This produces a Streamlit app, with MLOps instrumentation.
Install / Use
/learn @DerwenAI/StrwythuraREADME
Strwythura: Put the Context in Context Engineering
This tutorial explains how to construct an entity resolved knowledge graph from structured data sources and unstructured content sources, implementing an ontology pipeline, plus context engineering for optimizing AI application outcomes within a specific domain. The process gets enriched by entity embeddings and graph algorithms used in an enhanced GraphRAG approach, which implements a question/answer chat bot about a particular domain. The material here provides hands-on experience with advanced techniques as well as working code you can use elsewhere.
<img src="./strwythura/resources/logo.png" alt="Strwythura logo" width="231" style="display: block; margin-left: auto; margin-right: auto; width: 30%;" />
An article on Medium.com plus some other resources available online serve as "companions" for working with this repo. Please read along while running through each of the steps in this tutorial:
| purpose | URL | | ------- | --- | | words | https://blog.derwen.ai/strwythura-2a8007af3682 | | tubes | https://www.youtube.com/watch?v=ZWl5Pb29O-o | slides | https://tinyurl.com/strwythura | | codes | https://github.com/DerwenAI/strwythura | | quiz | https://derwen.ai/quiz/ai_def | | wiki | https://deepwiki.com/DerwenAI/strwythura | | DOI | https://doi.org/10.5281/zenodo.16934079 |
Overview
<details> <summary>What this tutorial includes and how to use it.
</summary> Downstream there can be multiple patterns of usage, such as graph analytics, dashboards, GraphRAG for question/answer chat bots, agents, memory, tools, planners, and so on. We will emphasize how to curate and leverage the domain-specific semantics, and optimize for the downstream AI application outcomes. Think of this as an interactive exploration of neurosymbolic AI in practice.
Code in the tutorial shows how to integrate popular Python libraries
following a maxim that OSFA and relying on monolithic frameworks
doesn't work well; using composable SDKs works much better:
Senzing, Placekey, LanceDB, spaCy, GLiNER, RDFlib,
NetworkX, ArrowSpace, DSPy, Ollama, Opik, Streamlit,
yWorks
The integration of these components is intended to run locally, and could be run within an air-gapped environment. However, you can also change the configuration to use remote LLM services instead.
Progressing through several steps, a workflow develops assets which are represented as a knowledge graph plus vector embeddings. This approach unbundles the processes which otherwise tend to be presented as monolithic "black box" frameworks. In contrast, we'll apply more intentional ways of developing the "context" in context engineering.
Overall, this tutorial explores the underlying concepts and technologies used in developing AI applications. Some of these technologies may be new to you, or at least you haven't worked with hands-on coding examples which integrating them:
entity resolution, named entity recognition, domain context, computable semantics, ontology pipeline, entity linking, textgraphs, human-in-the-loop, spectral indexing, interactive visualization, graph analytics, declarative LLM integration, retrieval-augmented generation, observability and optimization, epistemic literacy.
The end result is a Streamlit app which implements an enhanced GraphRAG for a question/answer chat bot. In terms of MLOps, this app runs instrumented in an environment for collecting observations about evaluations and other feedback, used for subsequent optimization.
Throughout the tutorial there are links to primary sources, articles, videos, open source projects, plus a bibliography of recommended books for further study.
The data and content used in this tutorial focus on a particular domain context: a +40-year research study which found that replacing processed red meat with healthier foods could help reduce the risk of dementia.
Structured data includes details about the researchers and organizations involved, while unstructured content includes articles from media coverage of the study. Other domains can be substituted: simply change the sources for data and content and provide a different domain taxonomy.
The code is written primarily as a tutorial, although it is also packaged as a Python library and can be extended for your use cases. The code is published as open source with a business-friendly license.
</details>Before going any further, take this brief quiz — as quick feedback about AI facts and fantasies: https://derwen.ai/quiz/ai_def

Getting Started
<details> <summary>Prerequisites for this tutorial, and steps to set up your local environment.
</summary> Prerequisites
- Level: Beginner - Intermediate
- Some experience coding in Python
- Familiarity with popular packages such as Git and Docker
Target Audience
- Data Scientists, Machine Learning Engineers
- Data Engineers, MLOps
- Team Leads and Managers for the roles above
Environment
The code uses Python versions 3.11 through 3.13, and gets validated on these through continuous integration.
The following must be downloaded and installed to run this tutorial:
- Git https://git-scm.com/install/
- Docker https://docs.docker.com/get-docker/
- Python 3.11-3.13 https://www.python.org/downloads/release/python-3139/
- Poetry https://python-poetry.org/docs/
- Ollama https://ollama.com/
- Opik https://github.com/comet-ml/opik
To get started, use git to clone the repo, then connect into the
repo directory and use poetry to install the Python dependencies:
git clone https://github.com/DerwenAI/strwythura.git
cd strwythura
poetry update
poetry run python3 -m spacy download en_core_web_md
Use docker to download the Senzing gRPC container to your
environment:
docker pull senzing/serve-grpc:latest
Use ollama to download the
gemma3:12b
large language model (LLM) and have it running locally:
ollama pull gemma3:12b
Download and run the opik server, using a different directory and
another terminal window:
git clone https://github.com/comet-ml/opik.git
cd opik
./opik.sh
</details>
Part 1: Entity Resolution
<details> <summary>Run <em>entity resolution</em> (ER) in the Senzing SDK to merge a collection of <strong>structured data sources</strong>, producing <em>entities</em> and <em>relations</em> among them — imagine this as the "backbone" for the resulting graph.
</summary> Launch the docker container for the Senzing gRPC server in another
terminal window and leave it running:
docker run -it --publish 8261:8261 --rm senzing/serve-grpc
For the domain in this tutorial, suppose we have two datasets about hypothetical business directories:
data/dementia/corp_home.json-- "Corporates Home UK"data/dementia/acme_biz.json-- "ACME Business Directory"
We also datasets about the online profiles of researchers and scientific authors:
The JSONL format used in these datasets is based on a specific data mapping which provides the entity resolution process with heuristics about features available in each structured dataset.
Now run the following Python module, which calls the Senzing SDK via a gRPC server, merging these four datasets:
poetry run python3 1_er.py
This step generates graph elements: entities, relations, properties --
which get serialized as the data/er.json JSONL file.
Take a look at the information represented in this file.
Part 2: Semantic Layer
<details> <summary>Generate a <em>domain-specific thesaurus</em> from the ER results and combine with a SKOS-based <em>domain taxonomy</em> to populate a <em>semantic layer</em> using <code>RDFlib</code>, to organize construction of an <em>entity-resolved knowledge graph</em> (ERKG) as a <code>NetworkX</code> property graph.
</summary> The following Python module takes input from:
- ER results in the
data/er.jsonfile in JSONL format, which were generated in "Part 1" - a domain taxonomy in the
data/dementia/domain.ttlfile in RDF "Turtle" format
poetry run python3 2_sem.py
This populates a semantic layer in RDF, which we manage in an
RDFlib semantic graph.
It also promotes elements from the semantic graph to become the
"backbone" for an entity-resolved knowledge graph (ERKG) which we
manage as a NetworkX property graph.
Alternatively, we could store the latter in a graph database.
The generated results:
- a domain-specific thesaurus serialized as the
data/thesaurus.ttlfile in RDF "Turtle" format - a knowledge graph serialized as the
data/erkg.jsonJSON file inNetworkXnode-link data format
Take a look at the information represented in these files.
</details>Part 3: Crawl and Parse Content
<details> <summary>Crawl a collection of documents as the <strong>unstructured content sources</strong>, chunking the text to create embeddings in a <em>vector store</em> using <code>LanceDB</code>, while also parsing the text to construct a <em>le
