Simulist
An R package for simulating line lists
Install / Use
/learn @epiverse-trace/SimulistREADME
simulist: Simulate line list data <img src="man/figures/logo.svg" align="right" width="120" />
<!-- badges: start --> <!-- badges: end -->{simulist} is an R package to simulate individual-level infectious
disease outbreak data, including line lists and contact tracing data. It
can often be useful to have synthetic datasets like these available when
demonstrating outbreak analytics techniques or testing new analysis
methods.
{simulist} is developed at the Centre for the Mathematical Modelling
of Infectious
Diseases
at the London School of Hygiene and Tropical
Medicine as part of
Epiverse-TRACE.
Key features
{simulist} allows you to simulate realistic line list and contact
tracing data, with:
:hourglass_flowing_sand: Parameterised epidemiological delay distributions <br> :hospital: Population-wide or age-stratified hospitalisation and death risks <br> :bar_chart: Uniform or age-structured populations <br> :chart_with_upwards_trend: Constant or time-varying case fatality risk <br> :clipboard: Customisable probability of case types and contact tracing follow-up <br>
Post-process simulated line list data for:
:date: Real-time outbreak snapshots with right-truncation <br> :memo: Messy data with inconsistencies, mistakes and missing values <br> :ledger: Censor dates to daily, epi- and iso-weekly, yearly and other groupings <br>
Installation
The package can be installed from CRAN using
install.packages("simulist")
You can install the development version of {simulist} from
GitHub with:
# check whether {pak} is installed
if(!require("pak")) install.packages("pak")
pak::pak("epiverse-trace/simulist")
Alternatively, install pre-compiled binaries from the Epiverse TRACE R-universe
install.packages("simulist", repos = c("https://epiverse-trace.r-universe.dev", "https://cloud.r-project.org"))
Quick start
library(simulist)
A line list can be simulated by calling sim_linelist(). The function
provides sensible defaults to quickly generate a epidemiologically valid
data set.
set.seed(1)
linelist <- sim_linelist()
head(linelist)
#> id case_name case_type sex age date_onset date_reporting
#> 1 1 James Manis suspected m 59 2023-01-01 2023-01-01
#> 2 2 Chen Moua confirmed m 90 2023-01-01 2023-01-01
#> 3 3 David Welter confirmed m 4 2023-01-02 2023-01-02
#> 4 5 Christopher Turner confirmed m 29 2023-01-04 2023-01-04
#> 5 6 Morgan Bohn suspected f 14 2023-01-05 2023-01-05
#> 6 7 Yutitham Corpuz probable m 85 2023-01-06 2023-01-06
#> date_admission outcome date_outcome date_first_contact date_last_contact
#> 1 2023-01-09 died 2023-01-13 <NA> <NA>
#> 2 <NA> recovered <NA> 2022-12-29 2023-01-03
#> 3 <NA> recovered <NA> 2022-12-28 2023-01-01
#> 4 <NA> recovered <NA> 2022-12-28 2023-01-04
#> 5 2023-01-09 died 2023-01-23 2022-12-31 2023-01-04
#> 6 2023-01-08 recovered <NA> 2022-12-31 2023-01-06
#> ct_value
#> 1 NA
#> 2 24.5
#> 3 24.8
#> 4 25.4
#> 5 NA
#> 6 NA
However, to simulate a more realistic line list using epidemiological
parameters estimated for an infectious disease outbreak we can use
previously estimated epidemiological parameters. These can be from the
{epiparameter} R package if available, or if these are not in the
{epiparameter} database yet (such as the contact distribution for
COVID-19) we can define them ourselves. Here we define a contact
distribution, period of infectiousness, onset-to-hospitalisation delay,
and onset-to-death delay.
library(epiparameter)
# create COVID-19 contact distribution
contact_distribution <- epiparameter::epiparameter(
disease = "COVID-19",
epi_name = "contact distribution",
prob_distribution = create_prob_distribution(
prob_distribution = "pois",
prob_distribution_params = c(mean = 2)
)
)
#> Citation cannot be created as author, year, journal or title is missing
# create COVID-19 infectious period
infectious_period <- epiparameter::epiparameter(
disease = "COVID-19",
epi_name = "infectious period",
prob_distribution = create_prob_distribution(
prob_distribution = "gamma",
prob_distribution_params = c(shape = 1, scale = 1)
)
)
#> Citation cannot be created as author, year, journal or title is missing
# create COVID-19 onset to hospital admission
onset_to_hosp <- epiparameter(
disease = "COVID-19",
epi_name = "onset to hospitalisation",
prob_distribution = create_prob_distribution(
prob_distribution = "lnorm",
prob_distribution_params = c(meanlog = 1, sdlog = 0.5)
)
)
#> Citation cannot be created as author, year, journal or title is missing
# get onset to death from {epiparameter} database
onset_to_death <- epiparameter::epiparameter_db(
disease = "COVID-19",
epi_name = "onset to death",
single_epiparameter = TRUE
)
#> Using Linton N, Kobayashi T, Yang Y, Hayashi K, Akhmetzhanov A, Jung S, Yuan
#> B, Kinoshita R, Nishiura H (2020). "Incubation Period and Other
#> Epidemiological Characteristics of 2019 Novel Coronavirus Infections
#> with Right Truncation: A Statistical Analysis of Publicly Available
#> Case Data." _Journal of Clinical Medicine_. doi:10.3390/jcm9020538
#> <https://doi.org/10.3390/jcm9020538>..
#> To retrieve the citation use the 'get_citation' function
To simulate a line list for COVID-19 with an Poisson contact
distribution with a mean number of contacts of 2 and a probability of
infection per contact of 0.5, we use the sim_linelist() function. The
mean number of contacts and probability of infection determine the
outbreak reproduction number, if the resulting reproduction number is
around one it means we will likely get a reasonably sized outbreak (10 -
1,000 cases, varying due to the stochastic simulation).
Warning: the reproduction number of the simulation results from
the contact distribution (contact_distribution) and the probability of
infection (prob_infection); the number of infections is a binomial
sample of the number of contacts for each case with the probability of
infection (i.e. being sampled) given by prob_infection. If the average
number of secondary infections from each primary case is greater than 1
then this can lead to the outbreak becoming extremely large. There is
currently no depletion of susceptible individuals in the simulation
model, so the maximum outbreak size (second element of the vector
supplied to the outbreak_size argument) can be used to return a line
list early without producing an excessively large data set.
set.seed(1)
linelist <- sim_linelist(
contact_distribution = contact_distribution,
infectious_period = infectious_period,
prob_infection = 0.5,
onset_to_hosp = onset_to_hosp,
onset_to_death = onset_to_death
)
head(linelist)
#> id case_name case_type sex age date_onset date_reporting
#> 1 1 Jennifer Pritchett confirmed f 1 2023-01-01 2023-01-01
#> 2 2 Tyler Payson confirmed f 29 2023-01-01 2023-01-01
#> 3 3 Sean Wong confirmed m 78 2023-01-01 2023-01-01
#> 4 5 Bishr al-Safar confirmed m 70 2023-01-01 2023-01-01
#> 5 6 Francisco Montgomery probable m 28 2023-01-01 2023-01-01
#> 6 8 Jack Millard suspected m 61 2023-01-01 2023-01-01
#> date_admission outcome date_outcome date_first_contact date_last_contact
#> 1 2023-01-03 died 2023-01-18 <NA> <NA>
#> 2 2023-01-03 died 2023-02-09 2022-12-30 2023-01-08
#> 3 <NA> recovered <NA> 2022-12-31 2023-01-05
#> 4 2023-01-04 recovered <NA> 2022-12-31 2023-01-04
#> 5 2023-01-05 recovered <NA> 2022-12-29 2023-01-02
#> 6 <NA> recovered <NA> 2022-12-28 2023-01-05
#> ct_value
