OMOP2OBO
OMOP2OBO: A Python Library for mapping OMOP standardized clinical terminologies to Open Biomedical Ontologies
Install / Use
/learn @callahantiff/OMOP2OBOREADME
omop2obo
|logo|
|pip| |downloads|
|github_action| |ABRA|
|sonar_quality| |code_climate_maintainability| |sonar_maintainability| |coveralls| |code_climate_coverage|
|
What is OMOP2OBO? #################
omop2obo is a collection of health system-wide, disease-agnostic mappings between standardized clinical terminologies in the Observational Medical Outcomes Partnership (OMOP) common data model and several Open Biomedical Ontologies (OBOs) foundry ontologies.
Motivation
Common data models have solved many challenges of utilizing electronic health records, but have not yet meaningfully integrated clinical and molecular data. Aligning clinical data to open biological ontologies (OBOs_), which provide semantically computable representations of biological knowledge, requires extensive manual curation and expertise.
Objective
To address these limitations, we have developed OMOP2OBO, the first health system-wide integration and alignment between the Observational Health Data Sciences and Informatics' Observational Medical Outcomes Partnership (OMOP) standardized clinical terminologies and eight OBO biomedical ontologies spanning diseases, phenotypes, anatomical entities, cell types, organisms, chemicals, metabolites, hormones, vaccines, and proteins. To verify that the mappings are both clinically and biologically meaningful, we have performed extensive experiments to verify the accuracy <https://github.com/callahantiff/OMOP2OBO/wiki/Accuracy>__, generalizability <https://github.com/callahantiff/OMOP2OBO/wiki/Generalizability>, and logical consistency <https://github.com/callahantiff/OMOP2OBO/wiki/Consistency>_ of each released mapping set.
📢 Manuscript preprint is available 👉 https://doi.org/10.48550/arXiv.2209.04732 <https://doi.org/10.48550/arXiv.2209.04732>__
What Does This Repository Provide?
Through this repository we provide the following:
-
Mappings: A free set of
omop2obomappings that can be used out of the box (requires no coding) covering OMOP Conditions, Drug Exposures, and Measurements. These mappings are available in several formats including:.txt,.xlsx, and.dump. We also provide a semantic representation of the mappings, integrated with the OBO biomedical ontologies, available as an edge list (.txt) and as an.owlfile. See current release for more details. -
A Mapping Framework: An algorithm and mapping pipeline that enables one to construct their set of
omop2obomappings. The figure below provides a high-level overview of the algorithm workflow. The code provided in this repository facilitates all of the automatic steps shown in this figure except for the manual mapping (for now, although we arecurrently <https://github.com/callahantiff/OMOP2OBO/issues/19>__ working on a deep learning model to address this).
How do I Learn More?
-
Join an existing or start a new
Discussion_ -
The Project
Wiki_ for more details on theomop2obomappings, algorithm, and information on the experiments we ran to ensure each mapping set released is accurate, generalizable, and consistent! -
A
Zenodo Community <https://zenodo.org/communities/omop2obo>__ has been established to provide access to software releases, presentations, and preprints related to this project
|
Releases ########
-
All code and mappings for each release are free to download, see
Wiki <https://github.com/callahantiff/OMOP2OBO/wiki>__ -
Please see our
dashboard <http://tiffanycallahan.com/OMOP2OBO_Dashboard>__ to get current stats on available mappings and for links to download them.
|dashboard1| |dashboard2|
Current Release:
-
v1.0.0➞ data and code can be directly downloadedhere <https://github.com/callahantiff/OMOP2OBO/wiki/V1.0>__.- Condition Occurrence Mappings: https://doi.org/10.5281/zenodo.6774363
- Drug Exposure Ingredient Mappings: https://doi.org/10.5281/zenodo.6774401
- Measurement Mappings: https://doi.org/10.5281/zenodo.6774443
|
Getting Started ###############
Install Library
This program requires Python version 3.6. To install the library from PyPI <https://pypi.org/project/omop2obo/>__, run:
.. code:: shell
pip install omop2obo
|
You can also clone the repository directly from GitHub by running:
.. code:: shell
git clone https://github.com/callahantiff/OMOP2OBO.git
|
Set-Up Environment
The omop2obo library requires a specific project directory structure. Please make sure that your project directory includes the following sub-directories:
.. code:: shell
OMOP2OBO/
|
|---- resources/
| |
| clinical_data/
| |
| mappings/
| |
| ontologies/
Results will be output to the mappings directory.
|
Dependencies ^^^^^^^^^^^^
APPLICATIONS
-
This software also relies on
OWLTools <https://github.com/owlcollab/owltools>__. If cloning the repository, theowltoolslibrary file will automatically be included and placed in the correct repository. -
The National of Library Medicine's Unified Medical Language System (UMLS)
MRCONSO <https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html>__ andMRSTY <https://www.ncbi.nlm.nih.gov/books/NBK9685/table/ch03.Tf/>_. Using these data requires a license agreement. Note that in order to get theMRSTYfile you will need to download the UMLS Metathesaurus and run MetamorphoSys. Once both data sources are obtained, please place the files in theresources/mappingsdirectory.
DATA
-
Clinical Data: This repository assumes that the clinical data that needs mapping has been placed in the
resources/clinical_datarepository. Each data source provided in this repository is assumed to have been extracted from the OMOP CDM. An example of what is expected for this input can be foundhere <https://github.com/callahantiff/OMOP2OBO/tree/master/resources/clinical_data>__. -
Ontology Data: Ontology data is automatically downloaded from the user provided input file
ontology_source_list.txt(here <https://github.com/callahantiff/OMOP2OBO/blob/master/resources/ontology_source_list.txt>__). -
Vocabulary Source Code Mapping: To increase the likelihood of capturing existing database cross-references,
omop2oboprovides a file that maps different clinical vocabulary source code prefixes between the UMLS, ontologies, and clinical EHR data (i.e. "SNOMED", "SNOMEDCT", "SNOMEDCT_US")source_code_vocab_map.csv(here <https://github.com/callahantiff/OMOP2OBO/blob/master/resources/mappings/source_code_vocab_map.csv>). Please note this file builds off ofthese <https://www.nlm.nih.gov/research/umls/sourcereleasedocs/index.html>UMLS provided abbreviation mappings. Currently, this file is updated for ontologies releasedjuly 2020, clinical data normlaized toOMOP_v5.0, andUMLS 2020AA. -
Semantic Mapping Representation: In order to create a semantic representation of the
omop2obomappings, an ontological specification for creating classes that span multiple ontologies (reosurces/mapping_semantics/omop2obo). This document only needs to be altered if you plan to utilize the semantic mapping transformation algorithm and want to use a different knowledge representation. Please the followingREADME <https://github.com/callahantiff/OMOP2OBO/tree/master/resources/mapping_semantics/README.md>__ for additional details on these resources.
|
Running the omop2obo Library
There are a few ways to run omop2obo. An example workflow is provided below.
.. code:: python
import glob import pandas as pd import pickle
from datetime import date, datetime
from omop2obo import ConceptAnnotator, OntologyDownloader, OntologyInfoExtractor, SimilarStringFinder
set some global variables
outfile = 'resources/mappings/OMOP2OBO_MAPPED_' date_today = '_' + datetime.strftime(datetime.strptime(str(date.today()), '%Y-%m-%d'), '%d%b%Y').upper()
download ontologies
ont = OntologyDownloader('resources/ontology_source_list.txt') ont.downloads_data_from_url()
process ontologies
ont_explorer = OntologyInfoExtractor('resources/ontologies', ont.data_files) ont_explorer.ontology_processor()
create master dictionary of processed ontologies
ont_explorer.ontology_loader()
read in ontology data
with open('resources/ontologies/master_ontology_dictionary.pickle', 'rb') as handle: ont_data = pickle.load(handle) handle.close()
process clinical data
mapper = ConceptAnnotator(clinical_file='resources/clinical_data/omop2obo_conditions_june2020.csv', ontology_dictionary={k: v for k, v in ont_data.items() if k in ['hp', 'mondo']}, merge=True, primary_key='CONCEPT_ID', concept_codes=tuple(['CONCEPT_SOURCE_CODE']), concept_strings=tuple(['CONCEPT_LABEL', 'CONCEPT_SYNONYM']), ancestor_codes=tuple(['ANCESTOR_SOURCE_CODE']), ancestor_strings=tuple(['ANCESTOR_LABEL']), umls_mrconso_file=glob.glob('resources/mappings/MRCONSO')[0] if len(glob.glob('resources/mappings/MRCONSO')) > 0 else None, umls_mrsty_file=glob.glob('resources/mappings/MRCONSO')[0] if len(glob.glob('resources/mappings/MRCONSO')) > 0 else None)
exact_mappings = mapper.clinical_concept_mapper()
exact_mappings.to_csv(outfile + 'CONDITIONS' + date_today + '.csv', sep=',', index=False, header=True)
# get column names -- used later to organize output
start_cols = [i for i in exact_mappings.columns if not any(j for j in ['STR', 'DBXREF', 'EVIDENCE'] if j in i)]
exact_cols = [i for i in exact_ma
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
groundhog
399Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
18.8kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
sec-edgar-agentkit
10AI agent toolkit for accessing and analyzing SEC EDGAR filing data. Build intelligent agents with LangChain, MCP-use, Gradio, Dify, and smolagents to analyze financial statements, insider trading, and company filings.
