getNCBImetadata

Retrieves NCBI metadata from nucleotide or biosample accession ids.

Requirements
Installation
Quick start
Output
License

Requirements

Linux or MacOS or Windows with Windows Subsystem for Linux (WSL) installed
Bash shell, which is the default shell on MacOS and many Linux distributions
Python 2.7 or Python 3
Edirect
A computer with internet access via the HTTPS protocol - required for retrieving data from NCBI

Installation

git clone https://github.com/AlexOrlek/getNCBImetadata.git
cd getNCBImetadata

You should find the getmetadata.py executable script within the repository directory. If you add the path of this directory to your $PATH variable, then the executable can be run by calling getmetadata.py [arguments...] from any directory location. Note also that the edirect directory must also be available in your $PATH variable.

Quick start

The -t flag specifies whether nucleotide or biosample accessions are provided in accessions.txt. The -e flag should be your own email address; this is provided to NCBI so that they can monitor usage. accessions.txt is a text file where the first column contains NCBI (nucleotide or biosample) accession ids.

Nucleotide metadata can be retrieved by running the following code:

getmetadata.py -a accessions.txt -t nucleotide -o outdir -e first.last@company.com

Either Refseq or Genbank nucleotide accessions can be provided. Nucleotide accessions can be provided in either "accession" or "accession.version" format.

BioSample metadata can be retrieved by running the following code:

getmetadata.py -a accessions.txt -t biosample -o outdir -e first.last@company.com --biosampleattributes attributes.txt

The --biosampleattributes flag is optional. It is used to specify a path to a file containing harmonized attribute names in the first column. A full list of BioSample attribute harmonized names is provided here. The specified attributes will be retrieved, in addition to default retrieved fields (see Output for details).

Output

Nucleotide metadata

When nucleotide accessions are provided, the following fields are extracted:

AccessionVersion
Dates of first submission and last update: Create Date, Update Date
Molecular characteristics: Molecule Type (e.g. dna), Length, Completeness, Source Genome Type (e.g. plasmid)
Taxonomy data: Source Taxon, Source Taxonomic ID
Genome assembly data: Assembly Method, Genome Coverage, Sequencing Technology
Genome annotation data: Annotation Pipeline, Annotation Method
DBLink data: Bioproject Accession, Biosample Accession, Sequence Read Archive Accession, Assembly Accession
PubMedID

Biosample metadata

When biosample accessions are provided, the following fields are extracted:

Identifiers: Accession, Accession ID, Sample name
Submission data: Model, Package
Dates: last_update, publication_date, submission_date
Title
Comment
Taxonomic data: taxonomy_id, taxonomy_name, OrganismName
Affiliation data: Owner/Name, email, Contact/Name/First, Contact/Name/Last
Attribute data will be retrieved if a file containing harmonized attribute names is provided to the --biosampleattributes flag.

License

MIT License

GetNCBImetadata

Install / Use

README