Bcdatabaser
A pipeline to create reference databases for arbitrary markers and taxonomic groups from NCBI data
Install / Use
/learn @molbiodiv/BcdatabaserREADME
BCdatabaser
Introduction
The Reference DB Creator (BCdatabaser) is a pipeline to create reference databases for arbitrary markers and taxonomic groups from NCBI data.
It can optionally be used to trim and orient the sequences and train taxonomic classifiers.
Please cite our publication in addition to the created dataset if you use this pipeline. Also consider citing the tools used in this pipeline as outlined in CITATION.
See the online docu for more details.
Web Interface
A user friendly web interface with limited options is available at https://bcdatabaser.molecular.eco
Overview
Input
Required:
- Search term (marker)
Optional
- Taxonomic scope (default: root)
- Taxonlist
- Length range
- Degenerate primers (trimming/orientation)
- HMMs (trimming/orientation) - not yet implemented
Output
Recommended naming scheme: <Marker>.<Taxonomic Group>.<country code>.<date>
Basic:
- Sequence fasta (filtered/trimmed/sintax style taxonomy header)
Additional:
- Report (e.g. Taxonomy table (Krona))
Steps
- Check dependencies
- Check inputs
- Download sequences
- optional: trim/orient via primers (or HMM in the future)
- Add taxonomy
- Write reports
Helper scripts
- geographicRange2taxonList
Installation
This pipeline consists of a perl script that depends on a number of libraries. It also utilizes external programs that need to be available. See the next section for details. As it can be difficult to download and setup all the dependencies we provide a ready to use docker container. We recommend using this container (it should also work with singularity). However, if you want or need a native installation please see the websites of the modules and tools for installation instructions for your platform and refer to the steps in the Dockerfile. If you encounter any problems please open an issue and we will do our best to help.
Docker
Pull the container from DockerHub and you are ready to go
docker pull iimog/bcdatabaser
The default command of this container is the bcdatabaser.pl script so you can execute this command.
docker run --rm iimog/bcdatabaser --help
In order to use data from your local directory inside the container and operate as your current user (instead of root):
docker run -u $UID:$GID -v $PWD:/data --rm iimog/bcdatabaser # arguments
You're all set, skip to the Examples section to get started.
The NCBI taxonomy databases in the docker image might be outdated. Build the image manually to get up to date versions:
git clone https://github.com/molbiodiv/bcdatabaser
cd bcdatabaser
docker build -t bcdatabaser .
Dependencies
Perl Modules:
- Pod::Usage
- FindBin
- Log::Log4perl
- Getopt::Long
- Getopt::ArgvFile
- File::Path
- NCBI::Taxonomy
External Programs:
Examples
These examples assume that you are using the docker installation. For Singularity or local installations the commands have to be adjusted accordingly.
Simple example (ITS2 sequences for the genus Bellis):
docker run -u $UID:$GID -v $PWD:/data --rm iimog/bcdatabaser\
--outdir its2.bellis.full.2018-11-12\
--marker-search-string "(ITS2 OR internal transcribed spacer 2)"\
--taxonomic-range Bellis\
--sequence-length-filter 100:2000
This created a folder its2.bellis.full.2018-11-12 in your current working directory with the following content:
- sequences.tax.fa: the sequences with sinax-style taxonomy information in the header
- taxonomy.krona.html: Krona chart of the taxa in the resulting database
- list.txt: list of accessions and taxids downloaded
- sequences.fa: the raw sequence download
- bcdatabaser.log: log file with all the messages printed to stdout
- CITATION: info on how to cite this pipeline
Advance example (ITS2 sequences for a custom species list with trimming using dispr):
docker run -u $UID:$GID -v $PWD:/data --rm iimog/bcdatabaser\
--outdir its2.viridiplantae.custom.2018-11-12\
--marker-search-string "(ITS2 OR internal transcribed spacer 2)"\
--taxonomic-range Viridiplantae\
--taxa-list /metabDB/examples/some_plants.txt\
--sequence-length-filter 100:20000\
--primer-file /metabDB/data/primers/its2.sickel2015.fa
This creates a folder in your current working directory with the following additional files (compaded to the previous example):
- sequences.dispr.fa: the sequences that were successfully trimmed and oriented using dispr and the provided primer pairs
- sequences.combined.fa: the trimmed/oriented sequences from above or the raw ones if the primer pair was not found on that sequence
Command Line Reference
Usage:
$ bcdatabaser.pl [@configfile] --marker-search-string="<SEARCHSTRING>" [options]
Options:
[@configfile] Optional path to a configfile with @ as prefix.
Config files consist of command line parameters
and arguments just as passed on the command
line. Space and comment lines are allowed (and
ignored). Spreading over multiple lines is
supported.
--marker-search-string <SEARCHSTRING>
Search string for the marker, passed literally
to the NCBI search
[--outdir <STRING>] output directory for the generated output files
(default: bcdatabaser)
[--taxonomic-range <SCINAME>]
Scientific (or common) name of the taxon all
sequences have to belong to. This will be added
to the query string as "AND <SCINAME>[ORGN]".
Default: empty (no restriction) Example:
--taxonomic-range Viridiplantae
[--taxa-list <FILENAME>] File with list of scientific (or common) names
to include (one per line). Attention: All
children of higher taxonomic ranks will be
included. This will be added to the query
string as "AND (<SCINAME1>[ORGN] OR
<SCINAME2>[ORGN] ...)". Default: empty (no
restriction) Example: --taxa-list
plants_in_germany.txt
[--check-tax-names] If this option is set all taxon names provided
by the user (--taxonomic-range and entries in
the --taxa-list file) are checked against the
names.dmp file from NCBI The program stops with
an error if any species name is not listed in
names.dmp. If --warn-failed-tax-names is set
the program does not stop but prints a warning.
Set the path to the names.dmp file via
--names-dmp-path Default: false
[--warn-failed-tax-names]
Only used if --check-tax-names is set. Instead
of dying when a tax name in the --tax-list is
unknown these taxa are removed from the search
string and a warning is printed. This means
that your search string does not contain all
taxa in your file so it is important to check
the warning to see which ones were removed.
Default: false
[--names-dmp-path <PATH>]
Path to the NCBI taxonomy names.dmp file to
check taxonomic names. Only relevant if
--check-tax-names is set. The default value is
suitable for use with the docker container. It
is the same file used by NCBI::Taxonomy to
assign the tax strings. Default:
/NCBI-Taxonomy/names.dmp Example:
--names-dmp-path $HOME/ncbi/names.dmp
[--sequence-length-filter <SLEN>]
Sequence length filter for search at NCBI
(single number or colon separated range). This
is only applied to the search initial search
via "AND <SLEN>[SLEN]". Empty string means no
restriction. Default: empty (no restriction)
Example: --sequence-length-filter 100:2000
[--sequences-per-taxon <INTEGER>]
Number of sequences to download for each
distinct NCBI taxid. If there are more
sequences for a taxid the longest ones are
kept. Default: 9 Example: --sequences-per-taxon
1
[--edirect-dir <STRING>] directory containing the entrez direct
utilities (default: empty, look for programs in
PATH) More info about edirect:
https://www.ncbi.nlm.nih.gov/books/NBK179288/
[--edirect-batch-size <INT>]
Related Skills
notion
341.2kNotion API for creating and managing pages, databases, and blocks.
feishu-drive
341.2k|
things-mac
341.2kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
341.2kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
