MultiSub

Prepares a SARS-CoV-2 submission for GISAID, NCBI or ENA. Can read GISAID or NCBI files, or plain fasta+tsv/csv/xls. Finds files in input directory and merges everything into a single output directory. Auto-detects input file formats. Can submit the results to multiple repositories from the command line.

Generate Convert Improve

Install / Use

/learn @maximilianh/MultiSub

About this skill

Quality Score

0/100

README

multiSub

multiSub is a command-line tool to prepare and/or submit a SARS-CoV-2 genome sequence to the NCBI Genbank, EBI ENA and GISAID sequence repositories. It can also convert between these formats. This tool can be used by a "data broker", a single institution that collects sequences from labs and submits them to the sequence databases or it can be used by an individual lab. It's research software, but several groups use it. When you use it for the first time on your own data, you may find find bugs or problems, for non-example data. Please open a ticket on Github or email maxh@ucsc.edu if that happens, most of them should be easy to fix now.

Overview

multiSub accepts input sequences in fasta format and meta data in tsv, csv or GISAID (xls or csv) formats. It will make some effort to clean the input data, e.g. skip missing sequences, strip flanking Ns, or remove empty meta data and output warnings if that happens. It can then create one or multiple output files, in NCBI, NCBI-tag, NCBI-ftp, ENA-xml or GISAID-csv format and directly upload to NCBI, ENA or GISAID.

The script takes care of the different ways to format the virus names (for example, hCov-19 for GISAID, SARS-CoV-2 for NCBI), translates the different ways to specify the country, checks the date format and adds sequence IDs where needed. It does not support more than the date and isolate and country fields, but other fields can be easily added, just email examples to maxh@ucsc.edu.

There is really only a single table and a single fasta file needed. The different export steps will pick out of the meta data table what they need. E.g. the field "Genome Coverage" will be exported by the NCBI Genbank step into a "structured-comment" field "Genome Coverage", and will also end up in the ENA fields "coverage" and GISAID's "covv_coverage". The meta table field names should either follow NCBI standards or be a GISAID file. As a matter of fact, there is an order to the steps: you first need to upload to NCBI Biosamples to obtain Biosamples accessions, then you re-convert to add these IDs to the files, then you can upload the new files to Genbank or SRA, with the Biosamples accessions in them. The examples below should make this clear.

Many thanks to Stephan Fuchs and Kyanoush Yahosseini, Robert Koch Institut, Berlin, for sending me their Python ENA uploader code, from which I copied. Also thanks to the ENA Helpdesk and the NCBI Helpdesk for their quick replies. Also to Kelsey Florek and Ethan Wang for bug reports. The NCBI bulk upload draws heavily from examples provided by Danny Park at the Broad Institute. Without all of these people, this program would not have been possible.

Installation and requirements

The script has usually no software requirements (see below). Just download it:

wget https://raw.githubusercontent.com/maximilianh/multiSub/stable/multiSub

or:

curl -O https://raw.githubusercontent.com/maximilianh/multiSub/stable/multiSub

Make it executable (not on Windows):

chmod a+x multiSub

And run it:

./multiSub --help

Or run it on Windows (install Python first from https://www.microsoft.com/en-us/p/python-38/9mssztt1n39l):

python multiSub

The script was tested on Python 2.7 and 3.6. If you do not plan to read GISAID xls files, you do not need to install anything else.

If you want to read GISAID xls files, the script needs the xlrd Python package. You can install xlrd with "pip install xlrd" or, if you are not administrator, with "pip install xlrd --user". If you use Mac OSX and do not have pip installed yet, run the command "curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py"

Using Microsoft Windows ? Little command line experience? Please contact me at maxh@ucsc.edu. The script runs on Windows, if you install Python from the Microsoft Store. I can also provide a normal Windows .exe version. Also, a drag-and-drop file box could be added for Windows GUI use.

Input

The first input is a fasta file with multiple sequences, where each sequence has a unique ID.

The second input file is the usual comma- or tab-separated table where the first column contains the sequence identifier, and the other columns contain the sequence annotations, sometimes called meta data, or "source tags" by NCBI. The first row contains the field names, sometimes called "headers". This file can be in NCBI, ENA or GISAID format. For tsv and csv input, the required meta field names are "date" and "isolate". For GISAID input, only the fields "covv_location", "covv_collection_date" and "covv_virus_name" are used at the moment. GISAID files can be in .xls or .csv format.

Output file formats

The basic files seqs.fa and meta.tsv will always be created.

By default, files in all possible output formats are created. If you only want to create a subset, use the -f option and list the formats that you need:

"ncbi" - for manual Genbank upload, as a single fasta file with integrated tags: genbank.seqAndSource.fa For manual sequence upload on https://submit.ncbi.nlm.nih.gov/sarscov2/
"ncbi-ftp" - for automated Genbank upload: genbankFtp.zip + submission.xml. See below for details.
"gisaid" - for GISAID batch or API upload in .csv format: gisaid.csv and gisaid.fa
"ena" - for ENA automated API sample uploads in XML format: ena.xml

Configuration

Run

multiSub init # 'python multiSub init' on Windows)

Then edit the file ~/.multiSub/config with your favorite text editor. You can set the name of your institute, your email address, your country, etc. To try the examples below, you do not even need to run the 'init' step yet, the script comes with default values.

Example: convert files

The absolutely minimal example:

printf '>CA-UCSC-123\nNNNNACTGT' > seq.fa
printf 'isolate,date\nCA-UCSC-123,2021-03-03' > meta.csv
./multiSub conv seq.fa meta.csv mini/

Or on Windows:

curl -O https://github.com/maximilianh/multiSub/blob/main/tests/mini/seqs.fa
curl -O https://raw.githubusercontent.com/maximilianh/multiSub/main/tests/mini/meta.csv 
python multiSub conv seqs.fa meta.csv mini/

This converts sequences from mySeqs.fa with annotations in mySeqs.tsv (fields: seqId, date, isolate) to the directory mySub/. It fixes up the isolate names to conform to the INSDC and GISAID formats, trims the sequences from flanking N nucleotides and adds the location name ("USA" by default, but you can change this in the configuration file). It creates files for NCBI Genbank and Biosamples, ENA Analysis/Biosample and GISAID submission, all at the same time.

Here is a bigger example:

mkdir my
curl https://raw.githubusercontent.com/maximilianh/multiSub/main/tests/ucsc1/mySeqs.fa -o my/mySeqs.fa
curl https://raw.githubusercontent.com/maximilianh/multiSub/main/tests/ucsc1/mySeqs.tsv -o my/mySeqs.tsv

./multiSub conv my/mySeqs.fa my/mySeqs.tsv mySub

A full copy of the output files is here: https://genome-test.gi.ucsc.edu/~max/multiSub/out/ucsc1/

Read all sequences and all annotation files (csv, tsv, xls) from mySeqs/ and write files for NCBI and GISAID into mySub/:

./multiSub convDir mySeqs mySub -f ncbi,gisaid

Possible meta annotation table field names (columns) and what happens to their data

There are four types of possible field names for the meta data:

the minimal ones: "isolate", "date" (alias "collection_date") and "location" (alias "country")
the following GISAID field names: covv_collection_date, covv_virus_name, covv_location, covv_assembly_method, covv_coverage, covv_seq_technology. Additional fields that are accepted in the input file and converted to GISAID fields are: addHost, gender, age, specimen, status, outbreak, last_vaccinated, treatment, coverage, sampleId, sub_lab_sample_id
NCBI source tags, listed here: https://www.ncbi.nlm.nih.gov/WebSub/html/help/genbank-source-table.html, most importantly isolate, collection_date and country (misnomer, it usually includes region and town)
the NCBI Structured comment field names: "Assembly Name", "Assembly Method", "Genome Coverage", "Sequencing Technology", see: https://www.ncbi.nlm.nih.gov/genbank/structuredcomment
For ENA, all fields are written to the Biosample. A few special names from the input file are converted to ENA required fields which are otherwise set to empty or default values: age -> 'host sex', 'status' -> 'host health state', 'subject' -> 'host subject id'. All others are passed through to ENA Biosample.

You can also rename any of your own input fields with other names but similar content to the NCBI names using the name mapping table in the configuration statement "metaFieldMap". See the sample config file config.sample in this repository.

Submission of the sample attributes as a manual NCBI Biosample upload

NCBI Biosample is a database that connects reads and assemblies and contains a couple of key-value entries, like sequencing coverage, the instrument or even the vaccination status. The list of all possible keys for SARS-CoV-2 is here: https://www.ncbi.nlm.nih.gov/biosample/docs/packages/SARS-CoV-2.cl.1.0/ A Biosample accession is only required if you want to upload the raw FASTQ reads later to the SRA.

For interactive uploads, go to https://submit.ncbi.nlm.nih.gov, create a new Biosamples submission and when prompted, upload the biosample.tsv file. Just click the "Continue" buttons until you are down. You should get the table with the accessions by email. Download this table into the output directory as biosampleAccs.tsv. The next "conv" run will then add these accessions to the various files as cross-references.

Submission of the consensus assembly as a manual NCBI Genbank upload

This is the most important database for public health purposes. Go to https://submit.ncbi.nlm.nih.gov/subs/genbank/, create a new submission and when prompted, upload the ncbiSeqsAndSource.fa file. The multiSub fasta file contains both the sequences and the source t

Related Skills

node-connect

340.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

340.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.2k

Commit, push, and open a PR