MicrobeDB
Provides a local database of in-house and published genomes for Bacteria and Archaea from NCBI
Install / Use
/learn @mlangill/MicrobeDBREADME
MicrobeDB
ABOUT
- MicrobeDB provides centralized local storage and access to completed archaeal and bacterial genomes.
- MicrobeDB contains three main features.
-
All "flat" files associated with the each genome are downloaded from NCBI (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi) and stored locally in a directory of your choosing.
-
For each genome, information about the organism, chromosomes within the organism, and genes within each chromosome are parsed and stored in a MySQL database including sequences and annotations.
-
A Perl API is provided to interface with the MySQL database and allow easy use of the data.
-
By default all RefSeq genomes are downloaded
- Incomplete/draft genomes can also be obtained.
- A subset of genomes from a particular genera can be obtained instead based on a search term.
- Unpublished/in-house genomes can be added easily.
-
A presentation providing an overview of MicrobeDB is at information/MicrobeDB_Overview.pdf
REQUIREMENTS
-
MySQL
-
Perl
-
Perl Modules (available from CPAN)
- BioPerl
- DBI
- DBD::mysql
- Parallel::ForkManager
- Log::Log4perl
- Sys::CPU
-
~20GB of hard drive space (this is based on default settings). Much less can be used by downloading a subset of genomes. Much more can be used if incomplete genomes or additional file types are downloaded.
Installing the MicrobeDB software
- Download MicrobeDB using a Git client or simply by clicking the "Download ZIP" on the top left of the webpage.
- For installation information see MicrobeDB/information/INSTALL/INSTALL.md.
Updating the MicrobeDB software
- To update your MicrobeDB software see MicrobeDB/information/UPDATE/UPDATE.md.
Testing the MicrobeDB installation
-
To test if MicrobeDB is installed properly we will download a single genome and load it into the MySQL database.
-
All programs are run from the command line and are located in "scripts" directory.
cd MicrobeDB/scripts -
Run MicrobeDB specifying a single genome (note this will create a dated directory in your home directory):
./download_load_and_delete_old_version.pl -d ~/ -s '-s Pseudomonas_aeruginosa_LES' -
The previous command should give you information about what it is doing without any errors.
-
You can test the MicrobeDB API and the ability to get information out of the database by running the example scripts:
cd MicrobeDB/information/example_scripts ./get_genome_sizes_of_aquatic_genomes.pl ./search_for_pathogen_recA_genes.pl -
Running the example scripts should print information to the terminal without any errors.
-
To remove this test "version". Run the command below and answer the interactive prompts:
MicrobeDB/scripts/delete_version.pl -
Great MicrobeDB is installed correctly and you can now proceed with downloading a complete version of genomes (see next section).
Downloading a new "version" of genomes with MicrobeDB
-
To use MicrobeDB you will have to download and load a new version of genome files.
-
All programs must be run from the command line and are located in the directory "scripts". Open up a console and change into the scripts directory:
cd MicrobeDB/scripts -
Now, by default MicrobeDB will download,unpack, parse and load all completed RefSeq genomes. You can do this with the following command (and replace the directory after the -d option with any directory where you want the genome files to be stored.)
./download_load_and_delete_old_version.pl -d /your_path/microbedb_genome_storage -
./download_load_and_delete_old_version.pl is a basically a wrapper script that runs 3 other scripts. It is conveniant because it does everything for you and can be easily set to run on a regular basis (monthly, bimonthly, etc.) as a "cron" job or with other scheduling software.
-
You can run the 3 scripts manually, which gives you more control and maybe useful in case there are any errors in the update.
- download_version.pl
- unpack_version.pl
- load_version.pl
-
You can use the -h option (e.g. ./download_version.pl -h) or 'perldoc download_version.pl' to get help for any of the scripts.
-
For example if you want to download incomplete genomes as well you can specify this with the -i option.
./download_version.pl -d /your_path/ -i -
If you wanted to download all E.coli strains (complete or incomplete) you can use the -s option.
./download_version.pl -d /your_path/ -i -s Escherichia_coli -
If you wanted to download other file formats for the genome beyond the required .gbk file then you can specify them with the -t option (seperated by commas)
./download_version.pl -d /your_path/ -t faa,fna,gff -
You can specify any of the download_version.pl options in the download_load_and_delete_old_version.pl script using the -s with single quotes.
./download_load_and_delete_old_version.pl -d /your_path/ -s '-i -s Escherichia_coli -t faa,fna'
Using multiple processors
-
If your computer has multiple processors, MicrobeDB can use these to increase the speed of unpacking and loading the genomes into MicrobeDB.
-
This is speficied using the '-p' option. Using it by itself with use all available processors on your computer. You can also limit the number of processors by specifying it after the option.
./download_load_and_delete_old_version.pl -d /your_path/ -p 2
Overview of MicrobeDB
-
Genome/Flat files are stored in one central location
-
Information at the genome project, chromosome, and gene level are parsed and stored in a MySQL database including sequences and annotations
-
The files and the database can be updated easily via a single script
-
The genome files are stored in consistent structure with many different file types:
- Bacteria_2009-09-01
- Acaryochloris_marina_MBIC11017
- Acholeplasma_laidlawii_PG_8A
- Acidimicrobium_ferrooxidans_DSM_10331
- Acidiphilium_cryptum_JF-5
- NC_009467.asn
- NC_009467.faa
- NC_009467.ffn
- NC_009467.fna
- NC_009467.gbk
- etc.
- Bacteria_2009-09-01
-
The MySQL database contains the following 4 main tables:
-
Version
- Each monthly download from NCBI is given a new version number
- Data will not change if you always use the same version number of microbedb
- Version date can be cited for any method publications
- Each version contains one or more Genomeprojects (genomes)
-
Genomeproject
- Contains information about the genome project and the organism that was sequenced
- E.g. taxon_id, org_name, lineage, gram_stain, genome_gc, patho_status, disease, genome_size, pathogenic_in, temp_range, habitat, shape, arrangement, endospore, motility, salinity, etc.
- Each genomeproject contains one or more Replicons
-
Replicon
- Chromosome, plasmids, or contigs (for incomplete genomes)
- E.g. rep_accnum, definition, rep_type, rep_ginum, cds_num, gene_num, protein_num, genome_id, rep_size, rna_num, rep_seq (complete nucleotide sequence)
- Each replicon contains one or more genes
-
Gene
- Contains gene annotations and also the DNA and protein sequences (if protein coding gene)
- E.g. gid, pid, protein_accnum, gene_type, gene_start, gene_end, gene_length, gene_strand, gene_name, locus_tag, gene_product, gene_seq, protein_seq
-
Using MicrobeDB
- Once MicrobeDB is installed and you have downloaded your first version of genomes you are ready to start using MicrobeDB.
- Since MicrobeDB parses genomes in a MySQL database you can search and retrieve information in various ways.
Searching with MySQL
- If you are familiar with MySQL syntax and are comfortable with a commandline then you can use the traditional MySQL client:
-
Connecting directly to MySQL database via command line client
mysql -u microbedb -p -
Then use MySQL syntax to do your queries. For example:
#get a list of all genomes that are described as pathogens select * from genomeproject where patho_status = 'pathogen' #get all genes with name "dnaA" select * from gene where gene_name='dnaA'
Installing 3rd party MySQL programs
- If you are not as familiar with MySQL syntax and would like a more pretty interface, then you can use other software to query MicrobeDB.
-
Using a client desktop application such as MySQL Workbench
- This is a simple to install and free software package provides many features which make querying MySQL databases easier.
-
Using a web based application such as phpMyAdmin
- phpMyAdmin is more difficult to install, but once it is it allows a web-based method to search and interact with your database.
Programming with the MicrobeDB API
-
If you know how to program in Perl you can use the MicrobeDB Perl API which allows you to retrieve data without constructing MySQL queries.
-
Example of a simple perl script using the MicrobeDB API that searches for all 'recA' genes and prints them in 'Fasta' format:
#Import the MicrobeDB API use lib '/your/path/to/MicrobeDB'; use MicrobeDB::Search; #intialize the search object $search_obj= MicrobeDB::Search(); #create the object that has properties that mus
