SkillAgentSearch skills...

GetNCBImetadata

Retrieves NCBI metadata from nucleotide or biosample accession ids.

Install / Use

/learn @AlexOrlek/GetNCBImetadata
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

getNCBImetadata

DOI

Retrieves NCBI metadata from nucleotide or biosample accession ids.

Table of contents

Requirements

  • Linux or MacOS or Windows with Windows Subsystem for Linux (WSL) installed
  • Bash shell, which is the default shell on MacOS and many Linux distributions
  • Python 2.7 or Python 3
  • Edirect
  • A computer with internet access via the HTTPS protocol - required for retrieving data from NCBI<br>

Installation

git clone https://github.com/AlexOrlek/getNCBImetadata.git
cd getNCBImetadata

You should find the getmetadata.py executable script within the repository directory. If you add the path of this directory to your $PATH variable, then the executable can be run by calling getmetadata.py [arguments...] from any directory location. Note also that the edirect directory must also be available in your $PATH variable.

Quick start

The -t flag specifies whether nucleotide or biosample accessions are provided in accessions.txt.<br> The -e flag should be your own email address; this is provided to NCBI so that they can monitor usage.<br> accessions.txt is a text file where the first column contains NCBI (nucleotide or biosample) accession ids.<br>

Nucleotide metadata can be retrieved by running the following code:

getmetadata.py -a accessions.txt -t nucleotide -o outdir -e first.last@company.com

Either Refseq or Genbank nucleotide accessions can be provided. Nucleotide accessions can be provided in either "accession" or "accession.version" format.<br>

BioSample metadata can be retrieved by running the following code:

getmetadata.py -a accessions.txt -t biosample -o outdir -e first.last@company.com --biosampleattributes attributes.txt

The --biosampleattributes flag is optional. It is used to specify a path to a file containing harmonized attribute names in the first column. A full list of BioSample attribute harmonized names is provided here. The specified attributes will be retrieved, in addition to default retrieved fields (see Output for details).<br>

Output

Nucleotide metadata

When nucleotide accessions are provided, the following fields are extracted:

  • AccessionVersion
  • Dates of first submission and last update: Create Date, Update Date
  • Molecular characteristics: Molecule Type (e.g. dna), Length, Completeness, Source Genome Type (e.g. plasmid)
  • Taxonomy data: Source Taxon, Source Taxonomic ID
  • Genome assembly data: Assembly Method, Genome Coverage, Sequencing Technology
  • Genome annotation data: Annotation Pipeline, Annotation Method
  • DBLink data: Bioproject Accession, Biosample Accession, Sequence Read Archive Accession, Assembly Accession
  • PubMedID<br>

Biosample metadata

When biosample accessions are provided, the following fields are extracted:

  • Identifiers: Accession, Accession ID, Sample name
  • Submission data: Model, Package
  • Dates: last_update, publication_date, submission_date
  • Title
  • Comment
  • Taxonomic data: taxonomy_id, taxonomy_name, OrganismName
  • Affiliation data: Owner/Name, email, Contact/Name/First, Contact/Name/Last
  • Attribute data will be retrieved if a file containing harmonized attribute names is provided to the --biosampleattributes flag.

License

MIT License

View on GitHub
GitHub Stars4
CategoryDevelopment
Updated7mo ago
Forks0

Languages

Python

Security Score

77/100

Audited on Aug 24, 2025

No findings