GISAIDR

Programmatically interact with the GISAID EpiCoV, EpiPox, and EpiRSV databases.

[!TIP] Please consider moving your research focus to an open pathogen on pathoplexus.org.

Citation

If you use GISAIDR in your research please cite as:

Wytamma Wirth, & Sebastian Duchene. (2022). GISAIDR: Programmatically interact with the GISAID databases. Zenodo. https://doi.org/10.5281/zenodo.6474693

Installation

Install from github using devtools.

install.packages("devtools") # if you have not installed "devtools" package
devtools::install_github("Wytamma/GISAIDR")
library(GISAIDR)

Login

Get username and password from GISAID.

username = Sys.getenv("GISAIDR_USERNAME")
password = Sys.getenv("GISAIDR_PASSWORD")

credentials <- login(username = username, password = password)

Select a database

The EpiCoV database is selected by default, however, GISAIDR also works with the EpiRSV and EpiPox databases (limited testing).

credentials <- login(username = username, password = password, database="EpiRSV")
# or
credentials <- login(username = username, password = password, database="EpiPox")

Note: You need a GISAID account with access to EpiRSV and EpiPox.

Get Data

Query the database with query() using your credentials

df <- query(credentials = credentials)
head(df[0:6])

| # | id | virus_name | passage_details_history | accession_id | collection_date | submission_date | |-----|-----------------|-------------------------------|-------------------------|-----------------|-----------------|-----------------| | 1 | EPI_ISL_1789201 | hCoV-19/USA/IL-S21WGS954/2021 | Original | EPI_ISL_1789201 | 2021-04-16 | 2021-04-29 | | 2 | EPI_ISL_1789200 | hCoV-19/USA/IL-S21WGS885/2021 | Original | EPI_ISL_1789200 | 2021-04-02 | 2021-04-29 | | 3 | EPI_ISL_1789199 | hCoV-19/USA/IL-S21WGS884/2021 | Original | EPI_ISL_1789199 | 2021-04-12 | 2021-04-29 | | 4 | EPI_ISL_1789198 | hCoV-19/USA/IL-S21WGS883/2021 | Original | EPI_ISL_1789198 | 2021-04-14 | 2021-04-29 | | 5 | EPI_ISL_1789197 | hCoV-19/USA/IL-S21WGS882/2021 | Original | EPI_ISL_1789197 | 2021-04-15 | 2021-04-29 | | 6 | EPI_ISL_1789196 | hCoV-19/USA/IL-S21WGS881/2021 | Original | EPI_ISL_1789196 | 2021-04-13 | 2021-04-29 |

Pagination

Use nrows and start_index to page through results. GISAID limits the number of results returned with each request to 50. Internally GISAIDR runs a loop to batch queries with > 50 rows requested. See fast option below.

df <- query(credentials = credentials, nrows = 1000, start_index = 100)
nrow(df)

[1] 1000

Fast query

Use fast to load all of the accesion_ids that match the query. These accesion_ids can then be used in the download function to download up to 5000 sequences at a time.

df <- query(
  credentials = credentials, 
  location = "Oceania", 
  from_subm = "2022-07-26", 
  to_subm = "2022-07-28",
  fast = TRUE
)
head(df$accession_id)

Selecting all 484 accession_ids.
Returning 0-484 of 484 accession_ids.
[1] "EPI_ISL_14061265" "EPI_ISL_14061266" "EPI_ISL_14061267" "EPI_ISL_14061268" "EPI_ISL_14061269" "EPI_ISL_14061270"

Ordering

Use order_by to order the results or query by a column. Use order_asc to change the direction of order_by (defaults to TRUE).

df <- query(credentials = credentials, order_by = 'submission_date')
df$submission_date

[1] "2020-01-10" "2020-01-10" "2020-01-11" "2020-01-11" "2020-01-11" "2020-01-12" "2020-01-14"
[8] "2020-01-14" "2020-01-14" "2020-01-14" "2020-01-16" "2020-01-17" "2020-01-17" ...

Full text search

Use text for full text search.

accession_ids = c("EPI_ISL_17398411", "EPI_ISL_17199001", "EPI_ISL_17409201", "EPI_ISL_17243716")
df <- query(credentials = credentials, text = paste(accession_ids, collapse = "\n"))
> df$accession_id

[1] "EPI_ISL_17199001" "EPI_ISL_17243716" "EPI_ISL_17398411" "EPI_ISL_17409201"

Search by location

Use location to search for entries based on geographic location.

df <- query(credentials = credentials, location = 'Australia')
df$location

[1] "Oceania / Australia / Western Australia" "Oceania / Australia / Queensland"
[3] "Oceania / Australia / Queensland" "Oceania / Australia / Queensland"
[5] "Oceania / Australia / Western Australia" ...

A list of GISAID locations (not complete) can be found in GISAID_LOCATIONS.txt. The location search is hierarchical e.g. querying for 'Africa / ...' will return all the regions within Africa, while querying for 'Africa / Angola / ...' will only return the regions in Angola. Region can be further subdivided by specifying more levels e.g. 'North America / USA / Alabama / Butler County'. The search uses pattern matching and does not have to follow the hierarchical format above.

Search by lineage (EpiCoV)

Use lineage to search for entries based on pango lineage designations.

df <- query(credentials = credentials, lineage = 'B.1.1.7')
full_df <- download(credentials = credentials, list_of_accession_ids = df$accession_id)  # see below for download() info.
full_df$pangolin_lineage

[1] "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7"
[11] "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7"
[21] ...

Search by Variant (EpiCoV)

Variants can be queried by name e.g. 'omicron', 'gh/490r', 'delta', 'alpha', 'beta', 'gamma', 'lambda', or 'mu'. Unfortunately GISAID doesn't return the variant designation from the query or download so variants must be confirmed with pangolin_lineage or GISAID_clade.

# VOC Omicron GRA (B.1.1.529+BA.*) first detected in Botswana/Hong Kong/South Africa
omicron_df <- query(credentials = credentials, variant = 'omicron')
omicron_full_df <- download(credentials = credentials, list_of_accession_ids = omicron_df$accession_id)
omicron_full_df$pangolin_lineage

[1] "BA.2" "BA.2" "BA.2.10.1" "BA.2" "BA.2" "BA.5" "BA.2" "BA.2" "BA.2.3"
[10] "BA.4" "BA.2" "BA.2" "BA.2" "BA.2" "BA.2" "BA.2" "BA.2" "BA.1.17"
[19] ...

Search by collection date

Use from and to to search for entries from specific dates.

df <- query(credentials = credentials, from = '2021-04-05', to = '2021-04-06')
df$collection_date

[1] "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05"
[8] "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05"
[15] ...

Search by submission date

Use from_subm and to_subm to search for entries from specific dates.

df <- query(credentials = credentials, from_subm = '2021-04-05', to_subm = '2021-04-05')
df$submission_date

[1] "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05"
[8] "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05" "2021-04-05"
[15] ...

Search by virus name

Use virus_name to search for entries using the virus name.

df <- query(credentials = credentials, virus_name="hCoV-19/Ireland/D-BHTEST/2022")
df$virus_name

[1] "hCoV-19/Ireland/D-BHTEST/2022"

Search for multiple virus names using a list.

virus_names <- list("hCoV-19/Ireland/KY-Enfer-230922004_A4/2022", "hCoV-19/Ireland/CO-Enfer-240922010_E9/2022")
df <- query(credentials = credentials, virus_name=virus_names)
df$virus_name

[1] "hCoV-19/Ireland/CO-Enfer-240922010_E9/2022" "hCoV-19/Ireland/KY-Enfer-230922004_A4/2022"

You can also match parts of the virus name e.g.

df <- query(credentials = credentials, virus_name="hCoV-19/Ireland")
df$virus_name

[1] "hCoV-19/Ireland/KY-Enfer-260922007_C6/2022" "hCoV-19/Ireland/KY-Enfer-260922007_C4/2022"
[3] "hCoV-19/Ireland/KY-Enfer-260922007_C2/2022" "hCoV-19/Ireland/KY-Enfer-260922007_C10/2022"
[5] "hCoV-19/Ireland/KY-Enfer-260922007_C1/2022" "hCoV-19/Ireland/CO-Enfer-260922007_B7/2022"...

Search by AA Substitutions and Nucleotide Mutations

Use aa_substitution and nucl_mutation to search for entries using amino acid Substitutions and nucleotide mutations.

aa_substitution_df <- query(credentials = credentials, aa_substitution = 'Spike_E484Q, Spike_H69del, -N_P13L')
nucl_mutation_df <- query(credentials = credentials, nucl_mutation = '-T23599G, -C10029T')

Exclude low coverage entries

Use low_coverage_excl to exclude low coverage entries from the results.

df <- query(credentials = credentials, low_coverage_excl = TRUE)
grep("Long stretches of NNNs", df$information)

integer(0)

Include only complete entries

GISAID considers genomes >29,000 nt as complete. Use complete to include only complete entries in the results.

df <- query(credentials = credentials, complete = TRUE)
all(df$length > 29000)

[1] TRUE

Include only high coverage entries

GISAID considers genomes with <1% Ns and <0.05% unique amino acid mutations as high coverage . Use high_coverage to include only high coverage entries in the results.

df <- query(credentials = credentials, high_coverage = TRUE)
length(grep("warn_sign", df$information)) == 0

[1] TRUE

Include only entries with complete collection date

Use collection_date_complete to include only entries with complete collection date.

GISAIDR

Install / Use

README

GISAIDR

Citation

Installation

Login

Select a database

Get Data

Pagination

Fast query

Ordering

Full text search

Search by location

Search by lineage (EpiCoV)

Search by Variant (EpiCoV)

Search by collection date

Search by submission date

Search by virus name

Search by AA Substitutions and Nucleotide Mutations

Exclude low coverage entries

Include only complete entries

Include only high coverage entries

Include only entries with complete collection date