Clinspacy
Clinical Natural Language Processing using spaCy, scispacy, and medspacy
Install / Use
/learn @kdpsingh/ClinspacyREADME
clinspacy
<!-- badges: start --> <!-- badges: end -->The goal of clinspacy is to perform biomedical named entity recognition, Unified Medical Language System (UMLS) concept mapping, and negation detection using the Python spaCy, scispacy, and medspacy packages.
Installation
You can install the CRAN version of clinspacy with:
install.packages('clinspacy')
You can install the GitHub version of clinspacy with:
remotes::install_github('ML4LHS/clinspacy', INSTALL_opts = '--no-multiarch')
How to load clinspacy
library(clinspacy)
Initiating clinspacy
Note: the very first time you run clinspacy_init() or clinspacy()
after installing the package, you may receive an error stating that
spaCy was unable to be imported because it was not found. Restarting
your R session should resolve the issue.
Initiating clinspacy is optional. If you do not initiate the package
using clinspacy_init(), it will be automatically initiated without the
UMLS linker. The UMLS linker takes up ~12 GB of RAM, so if you would
like to use the linker, you can initiate clinspacy with the linker. The
linker can still be added on later by reinitiating with the use_linker
argument set to
TRUE.
clinspacy_init() # This is optional! The default functionality is to initiatie clinspacy without the UMLS linker
Named entity recognition (without the UMLS linker)
The clinspacy() function can take a single string, a character vector,
or a data frame. It can output either a data frame or a file name.
A single character string as input
clinspacy('This patient has diabetes and CKD stage 3 but no HTN.')
#> | | | 0% | |==============================================================================================================| 100%
#> clinspacy_id entity lemma is_family is_historical is_hypothetical is_negated is_uncertain section_category
#> 1 1 patient patient FALSE FALSE FALSE FALSE FALSE <NA>
#> 2 1 diabetes diabetes FALSE FALSE FALSE FALSE FALSE <NA>
#> 3 1 CKD stage 3 ckd stage 3 FALSE FALSE FALSE FALSE FALSE <NA>
#> 4 1 HTN htn FALSE FALSE FALSE TRUE FALSE <NA>
clinspacy('HISTORY: He presents with chest pain. PMH: HTN. MEDICATIONS: This patient with diabetes is taking omeprazole, aspirin, and lisinopril 10 mg but is not taking albuterol anymore as his asthma has resolved. ALLERGIES: penicillin.', verbose = FALSE)
#> clinspacy_id entity lemma is_family is_historical is_hypothetical is_negated is_uncertain
#> 1 1 chest pain chest pain FALSE TRUE FALSE FALSE FALSE
#> 2 1 PMH PMH FALSE FALSE FALSE FALSE FALSE
#> 3 1 HTN htn FALSE FALSE FALSE FALSE FALSE
#> 4 1 patient patient FALSE FALSE FALSE FALSE FALSE
#> 5 1 diabetes diabetes FALSE FALSE FALSE FALSE FALSE
#> 6 1 omeprazole omeprazole FALSE FALSE FALSE FALSE FALSE
#> 7 1 aspirin aspirin FALSE FALSE FALSE FALSE FALSE
#> 8 1 lisinopril lisinopril FALSE FALSE FALSE FALSE FALSE
#> 9 1 albuterol albuterol FALSE FALSE FALSE TRUE FALSE
#> 10 1 asthma asthma FALSE FALSE FALSE TRUE FALSE
#> 11 1 penicillin penicillin FALSE FALSE FALSE FALSE FALSE
#> section_category
#> 1 <NA>
#> 2 past_medical_history
#> 3 past_medical_history
#> 4 medications
#> 5 medications
#> 6 medications
#> 7 medications
#> 8 medications
#> 9 medications
#> 10 medications
#> 11 allergies
A character vector as input
clinspacy(c('This pt has CKD and HTN', 'Pt only has CKD but no HTN'),
verbose = FALSE)
#> clinspacy_id entity lemma is_family is_historical is_hypothetical is_negated is_uncertain section_category
#> 1 1 CKD ckd FALSE FALSE FALSE FALSE FALSE <NA>
#> 2 1 HTN htn FALSE FALSE FALSE FALSE FALSE <NA>
#> 3 2 Pt pt FALSE FALSE FALSE FALSE FALSE <NA>
#> 4 2 CKD ckd FALSE FALSE FALSE FALSE FALSE <NA>
#> 5 2 HTN htn FALSE FALSE FALSE TRUE FALSE <NA>
A data frame as input
data.frame(text = c('This pt has CKD and HTN', 'Diabetes is present'),
stringsAsFactors = FALSE) %>%
clinspacy(df_col = 'text', verbose = FALSE)
#> clinspacy_id entity lemma is_family is_historical is_hypothetical is_negated is_uncertain section_category
#> 1 1 CKD ckd FALSE FALSE FALSE FALSE FALSE <NA>
#> 2 1 HTN htn FALSE FALSE FALSE FALSE FALSE <NA>
#> 3 2 Diabetes Diabetes FALSE FALSE FALSE FALSE FALSE <NA>
Saving the output to file
The output_file can then be piped into bind_clinspacy() or
bind_clinspacy_embeddings(). This saves a lot of time because you can
try different strategies of subsetting in both of these functions
without needing to re-process the original data.
if (!dir.exists(rappdirs::user_data_dir('clinspacy'))) {
dir.create(rappdirs::user_data_dir('clinspacy'), recursive = TRUE)
}
mtsamples = dataset_mtsamples()
mtsamples[1:5,]
#> note_id description medical_specialty
#> 1 1 A 23-year-old white female presents with complaint of allergies. Allergy / Immunology
#> 2 2 Consult for laparoscopic gastric bypass. Bariatrics
#> 3 3 Consult for laparoscopic gastric bypass. Bariatrics
#> 4 4 2-D M-Mode. Doppler. Cardiovascular / Pulmonary
#> 5 5 2-D Echocardiogram Cardiovascular / Pulmonary
#> sample_name
#> 1 Allergic Rhinitis
#> 2 Laparoscopic Gastric Bypass Consult - 2
#> 3 Laparoscopic Gastric Bypass Consult - 1
#> 4 2-D Echocardiogram - 1
#> 5 2-D Echocardiogram - 2
#>
