Datastew
Python library for intelligent data stewardship using Large Language Model (LLM) embeddings
Install / Use
/learn @SCAI-BIO/DatastewREADME
datastew
Datastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.
Installation
pip install datastew
Usage
Harmonizing excel/csv resources
You can directly import common data models, terminology sources or data dictionaries for harmonization directly from a csv, tsv or excel file. An example how to match two separate variable descriptions is shown in datastew/scripts/mapping_excel_example.py:
from datastew.io.source import DataDictionarySource
from datastew.harmonization import map_dictionary_to_dictionary
# Variable and description refer to the corresponding column names in your excel sheet
source = DataDictionarySource("source.xlxs", variable_field="var", description_field="desc")
target = DataDictionarySource("target.xlxs", variable_field="var", description_field="desc")
df = map_dictionary_to_dictionary(source, target)
df.to_excel("result.xlxs")
The resulting file contains the pairwise variable mapping based on the closest similarity for all possible matches as well as a similarity measure per row.
Per default this will use the local MPNet model, which may not yield the optimal performance. If you got an OpenAI API key it is possible to use their embedding API instead. To use your key, create a Vectorizer model and pass it to the function:
from datastew.embedding import Vectorizer
from datastew.harmonization import map_dictionary_to_dictionary
vectorizer = Vectorizer("text-embedding-ada-002", key="your_api_key")
df = map_dictionary_to_dictionary(source, target, vectorizer=vectorizer)
Creating and using stored mappings
Datastew uses a PostgreSQL backend with the pgvector extension to store and query embeddings. This allows for persistent terminology management and high-performance semantic search.
-
Initialize the repository and embedding model:
First, set up your database engine and ensure the schema is initialized. Datastew uses the psycopg (v3) driver for synchronous communication.
from sqlalchemy import create_engine from sqlalchemy.orm import sessionmaker from datastew.embedding import Vectorizer from datastew.repository import PostgreSQLRepository # 1. Define connection string (note the +psycopg driver) DB_URL = "postgresql+psycopg://user:password@localhost:5432/testdb" engine = create_engine(DB_URL) # 2. Initialize the database schema and pgvector extension (Run once) PostgreSQLRepository.setup_database(engine) # 3. Create a session factory SessionLocal = sessionmaker(bind=engine, autoflush=False) -
Populate the Repository
Use a session context to add terminologies, concepts, and mappings. Note: You must call session.commit() to persist changes before they become searchable.
vectorizer = Vectorizer() with SessionLocal() as session: # Inject the session into the repository repository = PostgreSQLRepository(session=session, vectorizer=vectorizer) # Add a terminology terminology = repository.add_terminology(name="snomed CT", short_name="SNOMED") # Create a concept text1 = "Diabetes mellitus (disorder)" concept1 = repository.add_concept( terminology_id=terminology.id, pref_label=text1, concept_identifier="SNOMED:11893007" ) # Add a mapping (this generates the embedding and stores it) repository.add_mapping(concept_id=concept1.id, text=text1) # Persist the data session.commit() -
Retrieve Closest Mappings
Query the database by generating an embedding for a new phrase and comparing it against stored records.
with SessionLocal() as session: repository = PostgreSQLRepository(session=session, vectorizer=vectorizer) query_text = "Sugar sickness" embedding = vectorizer.get_embedding(query_text) # Retrieve top 2 matches with similarity scores results = repository.get_closest_mappings(embedding, similarities=True, limit=2) for r in results: # Returns a MappingResult object containing the Mapping and a similarity float print(f"{r.mapping.concept.pref_label} | Similarity: {r.similarity:.4f}")
output:
snomed CT > Concept ID: 11893007 : Diabetes mellitus (disorder) | Diabetes mellitus (disorder) | Similarity: 0.4735338091850281
snomed CT > Concept ID: 73211009 : Hypertension (disorder) | Hypertension (disorder) | Similarity: 0.2003161907196045
You can also import data from file sources (csv, tsv, xlsx) or from a public API like OLS. An example script to download & compute embeddings for SNOMED from ebi OLS can be found in datastew/scripts/ols_snomed_retrieval.py.
Embedding visualization
You can visualize the embedding space of multiple data dictionary sources with t-SNE plots utilizing different language models. An example how to generate a t-sne plot is shown in datastew/scripts/tsne_visualization.py:
from datastew.embedding import Vectorizer
from datastew.io.source import DataDictionarySource
from datastew.visualisation import plot_embeddings
# Variable and description refer to the corresponding column names in your excel sheet
data_dictionary_source_1 = DataDictionarySource("source1.xlsx", variable_field="var", description_field="desc")
data_dictionary_source_2 = DataDictionarySource("source2.xlsx", variable_field="var", description_field="desc")
vectorizer = Vectorizer()
plot_embeddings([data_dictionary_source_1, data_dictionary_source_2], vectorizer=vectorizer)

