clinspacy

The goal of clinspacy is to perform biomedical named entity recognition, Unified Medical Language System (UMLS) concept mapping, and negation detection using the Python spaCy, scispacy, and medspacy packages.

Installation

You can install the CRAN version of clinspacy with:

install.packages('clinspacy')

You can install the GitHub version of clinspacy with:

remotes::install_github('ML4LHS/clinspacy', INSTALL_opts = '--no-multiarch')

How to load clinspacy

library(clinspacy)

Initiating clinspacy

Note: the very first time you run clinspacy_init() or clinspacy() after installing the package, you may receive an error stating that spaCy was unable to be imported because it was not found. Restarting your R session should resolve the issue.

Initiating clinspacy is optional. If you do not initiate the package using clinspacy_init(), it will be automatically initiated without the UMLS linker. The UMLS linker takes up ~12 GB of RAM, so if you would like to use the linker, you can initiate clinspacy with the linker. The linker can still be added on later by reinitiating with the use_linker argument set to TRUE.

clinspacy_init() # This is optional! The default functionality is to initiatie clinspacy without the UMLS linker

Named entity recognition (without the UMLS linker)

The clinspacy() function can take a single string, a character vector, or a data frame. It can output either a data frame or a file name.

A single character string as input

clinspacy('This patient has diabetes and CKD stage 3 but no HTN.')
#>   |                                                                                                                      |                                                                                                              |   0%  |                                                                                                                      |==============================================================================================================| 100%
#>   clinspacy_id      entity       lemma is_family is_historical is_hypothetical is_negated is_uncertain section_category
#> 1            1     patient     patient     FALSE         FALSE           FALSE      FALSE        FALSE             <NA>
#> 2            1    diabetes    diabetes     FALSE         FALSE           FALSE      FALSE        FALSE             <NA>
#> 3            1 CKD stage 3 ckd stage 3     FALSE         FALSE           FALSE      FALSE        FALSE             <NA>
#> 4            1         HTN         htn     FALSE         FALSE           FALSE       TRUE        FALSE             <NA>

clinspacy('HISTORY: He presents with chest pain. PMH: HTN. MEDICATIONS: This patient with diabetes is taking omeprazole, aspirin, and lisinopril 10 mg but is not taking albuterol anymore as his asthma has resolved. ALLERGIES: penicillin.', verbose = FALSE)
#>    clinspacy_id     entity      lemma is_family is_historical is_hypothetical is_negated is_uncertain
#> 1             1 chest pain chest pain     FALSE          TRUE           FALSE      FALSE        FALSE
#> 2             1        PMH        PMH     FALSE         FALSE           FALSE      FALSE        FALSE
#> 3             1        HTN        htn     FALSE         FALSE           FALSE      FALSE        FALSE
#> 4             1    patient    patient     FALSE         FALSE           FALSE      FALSE        FALSE
#> 5             1   diabetes   diabetes     FALSE         FALSE           FALSE      FALSE        FALSE
#> 6             1 omeprazole omeprazole     FALSE         FALSE           FALSE      FALSE        FALSE
#> 7             1    aspirin    aspirin     FALSE         FALSE           FALSE      FALSE        FALSE
#> 8             1 lisinopril lisinopril     FALSE         FALSE           FALSE      FALSE        FALSE
#> 9             1  albuterol  albuterol     FALSE         FALSE           FALSE       TRUE        FALSE
#> 10            1     asthma     asthma     FALSE         FALSE           FALSE       TRUE        FALSE
#> 11            1 penicillin penicillin     FALSE         FALSE           FALSE      FALSE        FALSE
#>        section_category
#> 1                  <NA>
#> 2  past_medical_history
#> 3  past_medical_history
#> 4           medications
#> 5           medications
#> 6           medications
#> 7           medications
#> 8           medications
#> 9           medications
#> 10          medications
#> 11            allergies

A character vector as input

clinspacy(c('This pt has CKD and HTN', 'Pt only has CKD but no HTN'),
          verbose = FALSE)
#>   clinspacy_id entity lemma is_family is_historical is_hypothetical is_negated is_uncertain section_category
#> 1            1    CKD   ckd     FALSE         FALSE           FALSE      FALSE        FALSE             <NA>
#> 2            1    HTN   htn     FALSE         FALSE           FALSE      FALSE        FALSE             <NA>
#> 3            2     Pt    pt     FALSE         FALSE           FALSE      FALSE        FALSE             <NA>
#> 4            2    CKD   ckd     FALSE         FALSE           FALSE      FALSE        FALSE             <NA>
#> 5            2    HTN   htn     FALSE         FALSE           FALSE       TRUE        FALSE             <NA>

A data frame as input

data.frame(text = c('This pt has CKD and HTN', 'Diabetes is present'),
           stringsAsFactors = FALSE) %>%
  clinspacy(df_col = 'text', verbose = FALSE)
#>   clinspacy_id   entity    lemma is_family is_historical is_hypothetical is_negated is_uncertain section_category
#> 1            1      CKD      ckd     FALSE         FALSE           FALSE      FALSE        FALSE             <NA>
#> 2            1      HTN      htn     FALSE         FALSE           FALSE      FALSE        FALSE             <NA>
#> 3            2 Diabetes Diabetes     FALSE         FALSE           FALSE      FALSE        FALSE             <NA>

Saving the output to file

The output_file can then be piped into bind_clinspacy() or bind_clinspacy_embeddings(). This saves a lot of time because you can try different strategies of subsetting in both of these functions without needing to re-process the original data.

if (!dir.exists(rappdirs::user_data_dir('clinspacy'))) {
  dir.create(rappdirs::user_data_dir('clinspacy'), recursive = TRUE)
}

mtsamples = dataset_mtsamples()

mtsamples[1:5,]
#>   note_id                                                      description          medical_specialty
#> 1       1 A 23-year-old white female presents with complaint of allergies.       Allergy / Immunology
#> 2       2                         Consult for laparoscopic gastric bypass.                 Bariatrics
#> 3       3                         Consult for laparoscopic gastric bypass.                 Bariatrics
#> 4       4                                             2-D M-Mode. Doppler. Cardiovascular / Pulmonary
#> 5       5                                               2-D Echocardiogram Cardiovascular / Pulmonary
#>                               sample_name
#> 1                       Allergic Rhinitis
#> 2 Laparoscopic Gastric Bypass Consult - 2
#> 3 Laparoscopic Gastric Bypass Consult - 1
#> 4                  2-D Echocardiogram - 1
#> 5                  2-D Echocardiogram - 2
#>

Clinspacy

Install / Use

README