datastew

GitHub Release

Datastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.

Installation

pip install datastew

Usage

Harmonizing excel/csv resources

You can directly import common data models, terminology sources or data dictionaries for harmonization directly from a csv, tsv or excel file. An example how to match two separate variable descriptions is shown in datastew/scripts/mapping_excel_example.py:

from datastew.io.source import DataDictionarySource
from datastew.harmonization import map_dictionary_to_dictionary

# Variable and description refer to the corresponding column names in your excel sheet
source = DataDictionarySource("source.xlxs", variable_field="var", description_field="desc")
target = DataDictionarySource("target.xlxs", variable_field="var", description_field="desc")

df = map_dictionary_to_dictionary(source, target)
df.to_excel("result.xlxs")

The resulting file contains the pairwise variable mapping based on the closest similarity for all possible matches as well as a similarity measure per row.

Per default this will use the local MPNet model, which may not yield the optimal performance. If you got an OpenAI API key it is possible to use their embedding API instead. To use your key, create a Vectorizer model and pass it to the function:

from datastew.embedding import Vectorizer
from datastew.harmonization import map_dictionary_to_dictionary

vectorizer = Vectorizer("text-embedding-ada-002", key="your_api_key")
df = map_dictionary_to_dictionary(source, target, vectorizer=vectorizer)

Creating and using stored mappings

Datastew uses a PostgreSQL backend with the pgvector extension to store and query embeddings. This allows for persistent terminology management and high-performance semantic search.

Initialize the repository and embedding model:

First, set up your database engine and ensure the schema is initialized. Datastew uses the psycopg (v3) driver for synchronous communication.

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from datastew.embedding import Vectorizer
from datastew.repository import PostgreSQLRepository

# 1. Define connection string (note the +psycopg driver)
DB_URL = "postgresql+psycopg://user:password@localhost:5432/testdb"
engine = create_engine(DB_URL)

# 2. Initialize the database schema and pgvector extension (Run once)
PostgreSQLRepository.setup_database(engine)

# 3. Create a session factory
SessionLocal = sessionmaker(bind=engine, autoflush=False)

Populate the Repository

Use a session context to add terminologies, concepts, and mappings. Note: You must call session.commit() to persist changes before they become searchable.

vectorizer = Vectorizer()

with SessionLocal() as session: # Inject the session into the repository
    repository = PostgreSQLRepository(session=session, vectorizer=vectorizer)

    # Add a terminology
    terminology = repository.add_terminology(name="snomed CT", short_name="SNOMED")

    # Create a concept
    text1 = "Diabetes mellitus (disorder)"
    concept1 = repository.add_concept(
        terminology_id=terminology.id,
        pref_label=text1,
        concept_identifier="SNOMED:11893007"
    )

    # Add a mapping (this generates the embedding and stores it)
    repository.add_mapping(concept_id=concept1.id, text=text1)

    # Persist the data
    session.commit()

Retrieve Closest Mappings

Query the database by generating an embedding for a new phrase and comparing it against stored records.

with SessionLocal() as session:
    repository = PostgreSQLRepository(session=session, vectorizer=vectorizer)

    query_text = "Sugar sickness"
    embedding = vectorizer.get_embedding(query_text)

    # Retrieve top 2 matches with similarity scores
    results = repository.get_closest_mappings(embedding, similarities=True, limit=2)

    for r in results:
        # Returns a MappingResult object containing the Mapping and a similarity float
        print(f"{r.mapping.concept.pref_label} | Similarity: {r.similarity:.4f}")

output:

snomed CT > Concept ID: 11893007 : Diabetes mellitus (disorder) | Diabetes mellitus (disorder) | Similarity: 0.4735338091850281
snomed CT > Concept ID: 73211009 : Hypertension (disorder) | Hypertension (disorder) | Similarity: 0.2003161907196045

You can also import data from file sources (csv, tsv, xlsx) or from a public API like OLS. An example script to download & compute embeddings for SNOMED from ebi OLS can be found in datastew/scripts/ols_snomed_retrieval.py.

Embedding visualization

You can visualize the embedding space of multiple data dictionary sources with t-SNE plots utilizing different language models. An example how to generate a t-sne plot is shown in datastew/scripts/tsne_visualization.py:

from datastew.embedding import Vectorizer
from datastew.io.source import DataDictionarySource
from datastew.visualisation import plot_embeddings

# Variable and description refer to the corresponding column names in your excel sheet
data_dictionary_source_1 = DataDictionarySource("source1.xlsx", variable_field="var", description_field="desc")
data_dictionary_source_2 = DataDictionarySource("source2.xlsx", variable_field="var", description_field="desc")

vectorizer = Vectorizer()
plot_embeddings([data_dictionary_source_1, data_dictionary_source_2], vectorizer=vectorizer)

t-SNE plot

Datastew

Install / Use

README