linelist: Tagging and Validating Epidemiological Data <img src="man/figures/logo.svg" align="right" width="120" />

linelist provides a safe entry point to the Epiverse software ecosystem, adding a foundational layer through tagging, validation, and safeguarding epidemiological data, to help make data pipelines more straightforward and robust.

Installation

Stable version

Our stable versions are released on CRAN, and can be installed using:

install.packages("linelist", build_vignettes = TRUE)

Development version

The development version of linelist can be installed from GitHub with:

if (!require(pak)) {
  install.packages("pak")
}
pak::pak("epiverse-trace/linelist")

</div>

Usage

knitr::include_graphics("man/figures/linelist_infographics.png")

linelist works by tagging key epidemiological data in a data.frame or a tibble to facilitate and strengthen data pipelines. The resulting object is a linelist object, which extends data.frame (or tibble) by providing three types of features:

a tagging system to identify key data, enabling access to these data using their tags rather than actual names, which may change over time and across datasets
validation of the tagged variables (making sure they are present and of the right type/class)
safeguards against accidental losses of tagged variables in common data handling operations

The short example below illustrates these different features. See the Documentation section for more in-depth examples and details about linelist objects.

# load packages and a dataset for the example
# -------------------------------------------
library(linelist)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dataset <- outbreaks::mers_korea_2015$linelist
head(dataset)
#>     id age age_class sex        place_infect reporting_ctry
#> 1 SK_1  68     60-69   M         Middle East    South Korea
#> 2 SK_2  63     60-69   F Outside Middle East    South Korea
#> 3 SK_3  76     70-79   M Outside Middle East    South Korea
#> 4 SK_4  46     40-49   F Outside Middle East    South Korea
#> 5 SK_5  50     50-59   M Outside Middle East    South Korea
#> 6 SK_6  71     70-79   M Outside Middle East    South Korea
#>                                              loc_hosp   dt_onset  dt_report
#> 1 Pyeongtaek St. Mary, Hospital, Pyeongtaek, Gyeonggi 2015-05-11 2015-05-19
#> 2 Pyeongtaek St. Mary, Hospital, Pyeongtaek, Gyeonggi 2015-05-18 2015-05-20
#> 3 Pyeongtaek St. Mary, Hospital, Pyeongtaek, Gyeonggi 2015-05-20 2015-05-20
#> 4 Pyeongtaek St. Mary, Hospital, Pyeongtaek, Gyeonggi 2015-05-25 2015-05-26
#> 5                           365 Yeollin Clinic, Seoul 2015-05-25 2015-05-27
#> 6 Pyeongtaek St. Mary, Hospital, Pyeongtaek, Gyeonggi 2015-05-24 2015-05-28
#>   week_report dt_start_exp dt_end_exp    dt_diag outcome   dt_death
#> 1     2015_21   2015-04-18 2015-05-04 2015-05-20   Alive       <NA>
#> 2     2015_21   2015-05-15 2015-05-20 2015-05-20   Alive       <NA>
#> 3     2015_21   2015-05-16 2015-05-16 2015-05-21    Dead 2015-06-04
#> 4     2015_22   2015-05-16 2015-05-20 2015-05-26   Alive       <NA>
#> 5     2015_22   2015-05-17 2015-05-17 2015-05-26   Alive       <NA>
#> 6     2015_22   2015-05-15 2015-05-17 2015-05-28    Dead 2015-06-01

# check known tagged variables
# ----------------------------
tags_names()
#>  [1] "id"             "date_onset"     "date_reporting" "date_admission"
#>  [5] "date_discharge" "date_outcome"   "date_death"     "gender"        
#>  [9] "age"            "location"       "occupation"     "hcw"           
#> [13] "outcome"

# build a linelist
# ----------------
x <- dataset %>%
  tibble() %>%
  make_linelist(
    date_onset = "dt_onset", # date of onset
    date_reporting = "dt_report", # date of reporting
    occupation = "age" # mistake
  )
x
#> 
#> // linelist object
#> # A tibble: 162 × 15
#>    id      age age_class sex   place_infect   reporting_ctry loc_hosp dt_onset  
#>    <chr> <int> <chr>     <fct> <fct>          <fct>          <fct>    <date>    
#>  1 SK_1     68 60-69     M     Middle East    South Korea    Pyeongt… 2015-05-11
#>  2 SK_2     63 60-69     F     Outside Middl… South Korea    Pyeongt… 2015-05-18
#>  3 SK_3     76 70-79     M     Outside Middl… South Korea    Pyeongt… 2015-05-20
#>  4 SK_4     46 40-49     F     Outside Middl… South Korea    Pyeongt… 2015-05-25
#>  5 SK_5     50 50-59     M     Outside Middl… South Korea    365 Yeo… 2015-05-25
#>  6 SK_6     71 70-79     M     Outside Middl… South Korea    Pyeongt… 2015-05-24
#>  7 SK_7     28 20-29     F     Outside Middl… South Korea    Pyeongt… 2015-05-21
#>  8 SK_8     46 40-49     F     Outside Middl… South Korea    Seoul C… 2015-05-26
#>  9 SK_9     56 50-59     M     Outside Middl… South Korea    Pyeongt… NA        
#> 10 SK_10    44 40-49     M     Outside Middl… China          Pyeongt… 2015-05-21
#> # ℹ 152 more rows
#> # ℹ 7 more variables: dt_report <date>, week_report <fct>, dt_start_exp <date>,
#> #   dt_end_exp <date>, dt_diag <date>, outcome <fct>, dt_death <date>
#> 
#> // tags: date_onset:dt_onset, date_reporting:dt_report, occupation:age
tags(x) # check available tags
#> $date_onset
#> [1] "dt_onset"
#> 
#> $date_reporting
#> [1] "dt_report"
#> 
#> $occupation
#> [1] "age"

validate_linelist() will error if one of your tagged column doesn’t have the correct type:

# validation of tagged variables
# ------------------------------
## (this flags a likely mistake: occupation should not be an integer)
validate_linelist(x)
#> Error: Some tags have the wrong class:
#>   - occupation: Must inherit from class 'character'/'factor', but has class 'integer'

# change tags: fix mistakes, add new ones
# ---------------------------------------
x <- x %>%
  set_tags(
    occupation = NULL, # tag removal
    gender = "sex", # new tag
    outcome = "outcome"
  )

# safeguards against actions losing tags
# --------------------------------------
## attemping to remove geographical info but removing dates by mistake
x_no_geo <- x %>%
  select(-(5:8))
#> Warning: The following tags have lost their variable:
#>  date_onset:dt_onset

For stronger pipelines, you can even trigger errors upon loss:

lost_tags_action("error")
#> Lost tags will now issue an error.

x_no_geo <- x %>%
  select(-(5:8))
#> Error: The following tags have lost their variable:
#>  date_onset:dt_onset

x_no_geo <- x %>%
  select(-(5:7))

## to revert to default behaviour (warning upon error)
lost_tags_action()
#> Lost tags will now issue a warning.

Alternatively, content can be accessed by tags:

x_no_geo %>%
  select(has_tag(c("date_onset", "outcome")))
#> Warning: The following tags have lost their variable:
#>  date_reporting:dt_report, gender:sex
#> 
#> // linelist object
#> # A tibble: 162 × 2
#>    dt_onset   outcome
#>    <date>     <fct>  
#>  1 2015-05-11 Alive  
#>  2 2015-05-18 Alive  
#>  3 2015-05-20 Dead   
#>  4 2015-05-25 Alive  
#>  5 2015-05-25 Alive  
#>  6 2015-05-24 Dead   
#>  7 2015-05-21 Alive  
#>  8 2015-05-26 Alive  
#>  9 NA         Alive  
#> 10 2015-05-21 Alive  
#> # ℹ 152 more rows
#> 
#> // tags: date_onset:dt_onset, outcome:outcome

x_no_geo %>%
  tags_df()
#> # A tibble: 162 × 4
#>    date_onset date_reporting gender outcome
#>    <date>     <date>         <fct>  <fct>  
#>  1 2015-05-11 2015-05-19     M      Alive  
#>  2 2015-05-18 2015-05-20     F      Alive  
#>  3 2015-05-20 2015-05-20     M      Dead   
#>  4 2015-05-25 2015-05-26     F      Alive  
#>  5 2015-05-25 2015-05-27     M      Alive  
#>  6 2015-05-24 2015-05-28     M      Dead   
#>  7 2015-05-21 2015-05-28     F      Alive  
#>  8 2015-05-26 2015-05-29     F      Alive  
#>  9 NA         2015-05-29     M      Alive  
#> 10 2015-05-21 2015-05-29     M      Alive  
#> # ℹ 152 more rows

linelist can also be connected to the incidence2 package for pipelines focused on aggregated count data:

library(incidence2)
#> Loading required package: grates

x_no_geo

Linelist

Install / Use

README

linelist: Tagging and Validating Epidemiological Data <img src="man/figures/logo.svg" align="right" width="120" />

Installation

Stable version

Development version

Usage