Strwythura: Put the Context in Context Engineering

License Repo size

This tutorial explains how to construct an entity resolved knowledge graph from structured data sources and unstructured content sources, implementing an ontology pipeline, plus context engineering for optimizing AI application outcomes within a specific domain. The process gets enriched by entity embeddings and graph algorithms used in an enhanced GraphRAG approach, which implements a question/answer chat bot about a particular domain. The material here provides hands-on experience with advanced techniques as well as working code you can use elsewhere.

An article on Medium.com plus some other resources available online serve as "companions" for working with this repo. Please read along while running through each of the steps in this tutorial:

| purpose | URL | | ------- | --- | | words | https://blog.derwen.ai/strwythura-2a8007af3682 | | tubes | https://www.youtube.com/watch?v=ZWl5Pb29O-o | slides | https://tinyurl.com/strwythura | | codes | https://github.com/DerwenAI/strwythura | | quiz | https://derwen.ai/quiz/ai_def | | wiki | https://deepwiki.com/DerwenAI/strwythura | | DOI | https://doi.org/10.5281/zenodo.16934079 |

Overview

What this tutorial includes and how to use it.

</summary>

Downstream there can be multiple patterns of usage, such as graph analytics, dashboards, GraphRAG for question/answer chat bots, agents, memory, tools, planners, and so on. We will emphasize how to curate and leverage the domain-specific semantics, and optimize for the downstream AI application outcomes. Think of this as an interactive exploration of neurosymbolic AI in practice.

Code in the tutorial shows how to integrate popular Python libraries following a maxim that OSFA and relying on monolithic frameworks doesn't work well; using composable SDKs works much better: Senzing, Placekey, LanceDB, spaCy, GLiNER, RDFlib, NetworkX, ArrowSpace, DSPy, Ollama, Opik, Streamlit, yWorks

The integration of these components is intended to run locally, and could be run within an air-gapped environment. However, you can also change the configuration to use remote LLM services instead.

Progressing through several steps, a workflow develops assets which are represented as a knowledge graph plus vector embeddings. This approach unbundles the processes which otherwise tend to be presented as monolithic "black box" frameworks. In contrast, we'll apply more intentional ways of developing the "context" in context engineering.

Overall, this tutorial explores the underlying concepts and technologies used in developing AI applications. Some of these technologies may be new to you, or at least you haven't worked with hands-on coding examples which integrating them:

entity resolution, named entity recognition, domain context, computable semantics, ontology pipeline, entity linking, textgraphs, human-in-the-loop, spectral indexing, interactive visualization, graph analytics, declarative LLM integration, retrieval-augmented generation, observability and optimization, epistemic literacy.

The end result is a Streamlit app which implements an enhanced GraphRAG for a question/answer chat bot. In terms of MLOps, this app runs instrumented in an environment for collecting observations about evaluations and other feedback, used for subsequent optimization.

Throughout the tutorial there are links to primary sources, articles, videos, open source projects, plus a bibliography of recommended books for further study.

The data and content used in this tutorial focus on a particular domain context: a +40-year research study which found that replacing processed red meat with healthier foods could help reduce the risk of dementia.

Structured data includes details about the researchers and organizations involved, while unstructured content includes articles from media coverage of the study. Other domains can be substituted: simply change the sources for data and content and provide a different domain taxonomy.

The code is written primarily as a tutorial, although it is also packaged as a Python library and can be extended for your use cases. The code is published as open source with a business-friendly license.

</details>

Before going any further, take this brief quiz — as quick feedback about AI facts and fantasies: https://derwen.ai/quiz/ai_def

Getting Started

Prerequisites for this tutorial, and steps to set up your local environment.

</summary>

Prerequisites

Level: Beginner - Intermediate
Some experience coding in Python
Familiarity with popular packages such as Git and Docker

Target Audience

Data Scientists, Machine Learning Engineers
Data Engineers, MLOps
Team Leads and Managers for the roles above

Environment

The code uses Python versions 3.11 through 3.13, and gets validated on these through continuous integration.

The following must be downloaded and installed to run this tutorial:

Git https://git-scm.com/install/
Docker https://docs.docker.com/get-docker/
Python 3.11-3.13 https://www.python.org/downloads/release/python-3139/
Poetry https://python-poetry.org/docs/
Ollama https://ollama.com/
Opik https://github.com/comet-ml/opik

To get started, use git to clone the repo, then connect into the repo directory and use poetry to install the Python dependencies:

git clone https://github.com/DerwenAI/strwythura.git
cd strwythura

poetry update
poetry run python3 -m spacy download en_core_web_md

Use docker to download the Senzing gRPC container to your environment:

docker pull senzing/serve-grpc:latest

Use ollama to download the gemma3:12b large language model (LLM) and have it running locally:

ollama pull gemma3:12b

Download and run the opik server, using a different directory and another terminal window:

git clone https://github.com/comet-ml/opik.git
cd opik
./opik.sh

</details>

Part 1: Entity Resolution

Run entity resolution (ER) in the Senzing SDK to merge a collection of structured data sources, producing entities and relations among them — imagine this as the "backbone" for the resulting graph.

</summary>

Launch the docker container for the Senzing gRPC server in another terminal window and leave it running:

docker run -it --publish 8261:8261 --rm senzing/serve-grpc

For the domain in this tutorial, suppose we have two datasets about hypothetical business directories:

data/dementia/corp_home.json -- "Corporates Home UK"
data/dementia/acme_biz.json -- "ACME Business Directory"

We also datasets about the online profiles of researchers and scientific authors:

data/dementia/orcid.json -- ORCID
data/dementia/scopus.json -- Scopus

The JSONL format used in these datasets is based on a specific data mapping which provides the entity resolution process with heuristics about features available in each structured dataset.

Now run the following Python module, which calls the Senzing SDK via a gRPC server, merging these four datasets:

poetry run python3 1_er.py

This step generates graph elements: entities, relations, properties -- which get serialized as the data/er.json JSONL file. Take a look at the information represented in this file.

</details>

Part 2: Semantic Layer

Generate a domain-specific thesaurus from the ER results and combine with a SKOS-based domain taxonomy to populate a semantic layer using <code>RDFlib</code>, to organize construction of an entity-resolved knowledge graph (ERKG) as a <code>NetworkX</code> property graph.

</summary>

The following Python module takes input from:

ER results in the data/er.json file in JSONL format, which were generated in "Part 1"
a domain taxonomy in the data/dementia/domain.ttl file in RDF "Turtle" format

poetry run python3 2_sem.py

This populates a semantic layer in RDF, which we manage in an RDFlib semantic graph. It also promotes elements from the semantic graph to become the "backbone" for an entity-resolved knowledge graph (ERKG) which we manage as a NetworkX property graph. Alternatively, we could store the latter in a graph database.

The generated results:

a domain-specific thesaurus serialized as the data/thesaurus.ttl file in RDF "Turtle" format
a knowledge graph serialized as the data/erkg.json JSON file in NetworkX node-link data format

Take a look at the information represented in these files.

</details>

Part 3: Crawl and Parse Content

Crawl a collection of documents as the unstructured content sources, chunking the text to create embeddings in a vector store using <code>LanceDB</code>, while also parsing the text to construct a le

Strwythura

Install / Use

README

Strwythura: Put the Context in Context Engineering

Overview

Getting Started

Prerequisites

Target Audience

Environment

Part 1: Entity Resolution

Part 2: Semantic Layer

Part 3: Crawl and Parse Content