SkillAgentSearch skills...

MicrobeDB

Provides a local database of in-house and published genomes for Bacteria and Archaea from NCBI

Install / Use

/learn @mlangill/MicrobeDB
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

MicrobeDB

ABOUT

  • MicrobeDB provides centralized local storage and access to completed archaeal and bacterial genomes.
  • MicrobeDB contains three main features.
  1. All "flat" files associated with the each genome are downloaded from NCBI (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi) and stored locally in a directory of your choosing.

  2. For each genome, information about the organism, chromosomes within the organism, and genes within each chromosome are parsed and stored in a MySQL database including sequences and annotations.

  3. A Perl API is provided to interface with the MySQL database and allow easy use of the data.

  • By default all RefSeq genomes are downloaded

    • Incomplete/draft genomes can also be obtained.
    • A subset of genomes from a particular genera can be obtained instead based on a search term.
    • Unpublished/in-house genomes can be added easily.
  • A presentation providing an overview of MicrobeDB is at information/MicrobeDB_Overview.pdf

REQUIREMENTS

  • MySQL

  • Perl

  • Perl Modules (available from CPAN)

    • BioPerl
    • DBI
    • DBD::mysql
    • Parallel::ForkManager
    • Log::Log4perl
    • Sys::CPU
  • ~20GB of hard drive space (this is based on default settings). Much less can be used by downloading a subset of genomes. Much more can be used if incomplete genomes or additional file types are downloaded.

Installing the MicrobeDB software

Updating the MicrobeDB software

Testing the MicrobeDB installation

  • To test if MicrobeDB is installed properly we will download a single genome and load it into the MySQL database.

  • All programs are run from the command line and are located in "scripts" directory.

      cd MicrobeDB/scripts
    
  • Run MicrobeDB specifying a single genome (note this will create a dated directory in your home directory):

      ./download_load_and_delete_old_version.pl -d ~/ -s '-s Pseudomonas_aeruginosa_LES'
    
  • The previous command should give you information about what it is doing without any errors.

  • You can test the MicrobeDB API and the ability to get information out of the database by running the example scripts:

      cd MicrobeDB/information/example_scripts
      ./get_genome_sizes_of_aquatic_genomes.pl
      ./search_for_pathogen_recA_genes.pl
    
  • Running the example scripts should print information to the terminal without any errors.

  • To remove this test "version". Run the command below and answer the interactive prompts:

      MicrobeDB/scripts/delete_version.pl
      
    
  • Great MicrobeDB is installed correctly and you can now proceed with downloading a complete version of genomes (see next section).

Downloading a new "version" of genomes with MicrobeDB

  • To use MicrobeDB you will have to download and load a new version of genome files.

  • All programs must be run from the command line and are located in the directory "scripts". Open up a console and change into the scripts directory:

      cd MicrobeDB/scripts
    
  • Now, by default MicrobeDB will download,unpack, parse and load all completed RefSeq genomes. You can do this with the following command (and replace the directory after the -d option with any directory where you want the genome files to be stored.)

      ./download_load_and_delete_old_version.pl -d /your_path/microbedb_genome_storage
    
  • ./download_load_and_delete_old_version.pl is a basically a wrapper script that runs 3 other scripts. It is conveniant because it does everything for you and can be easily set to run on a regular basis (monthly, bimonthly, etc.) as a "cron" job or with other scheduling software.

  • You can run the 3 scripts manually, which gives you more control and maybe useful in case there are any errors in the update.

  1. download_version.pl
  2. unpack_version.pl
  3. load_version.pl
  • You can use the -h option (e.g. ./download_version.pl -h) or 'perldoc download_version.pl' to get help for any of the scripts.

  • For example if you want to download incomplete genomes as well you can specify this with the -i option.

      ./download_version.pl -d /your_path/ -i
    
  • If you wanted to download all E.coli strains (complete or incomplete) you can use the -s option.

      ./download_version.pl -d /your_path/ -i -s Escherichia_coli
    
  • If you wanted to download other file formats for the genome beyond the required .gbk file then you can specify them with the -t option (seperated by commas)

      ./download_version.pl -d /your_path/ -t faa,fna,gff
    
  • You can specify any of the download_version.pl options in the download_load_and_delete_old_version.pl script using the -s with single quotes.

      ./download_load_and_delete_old_version.pl -d /your_path/ -s '-i -s Escherichia_coli -t faa,fna'
    

Using multiple processors

  • If your computer has multiple processors, MicrobeDB can use these to increase the speed of unpacking and loading the genomes into MicrobeDB.

  • This is speficied using the '-p' option. Using it by itself with use all available processors on your computer. You can also limit the number of processors by specifying it after the option.

      ./download_load_and_delete_old_version.pl -d /your_path/ -p 2
    

Overview of MicrobeDB

  • Genome/Flat files are stored in one central location

  • Information at the genome project, chromosome, and gene level are parsed and stored in a MySQL database including sequences and annotations

  • The files and the database can be updated easily via a single script

  • The genome files are stored in consistent structure with many different file types:

    • Bacteria_2009-09-01
      • Acaryochloris_marina_MBIC11017
      • Acholeplasma_laidlawii_PG_8A
      • Acidimicrobium_ferrooxidans_DSM_10331
      • Acidiphilium_cryptum_JF-5
        • NC_009467.asn
        • NC_009467.faa
        • NC_009467.ffn
        • NC_009467.fna
        • NC_009467.gbk
        • etc.
  • The MySQL database contains the following 4 main tables:

    • Version

      • Each monthly download from NCBI is given a new version number
      • Data will not change if you always use the same version number of microbedb
      • Version date can be cited for any method publications
      • Each version contains one or more Genomeprojects (genomes)
    • Genomeproject

      • Contains information about the genome project and the organism that was sequenced
      • E.g. taxon_id, org_name, lineage, gram_stain, genome_gc, patho_status, disease, genome_size, pathogenic_in, temp_range, habitat, shape, arrangement, endospore, motility, salinity, etc.
      • Each genomeproject contains one or more Replicons
    • Replicon

      • Chromosome, plasmids, or contigs (for incomplete genomes)
      • E.g. rep_accnum, definition, rep_type, rep_ginum, cds_num, gene_num, protein_num, genome_id, rep_size, rna_num, rep_seq (complete nucleotide sequence)
      • Each replicon contains one or more genes
    • Gene

      • Contains gene annotations and also the DNA and protein sequences (if protein coding gene)
      • E.g. gid, pid, protein_accnum, gene_type, gene_start, gene_end, gene_length, gene_strand, gene_name, locus_tag, gene_product, gene_seq, protein_seq

Using MicrobeDB

  • Once MicrobeDB is installed and you have downloaded your first version of genomes you are ready to start using MicrobeDB.
  • Since MicrobeDB parses genomes in a MySQL database you can search and retrieve information in various ways.

Searching with MySQL

  • If you are familiar with MySQL syntax and are comfortable with a commandline then you can use the traditional MySQL client:
  1. Connecting directly to MySQL database via command line client

     mysql -u microbedb -p
    
  2. Then use MySQL syntax to do your queries. For example:

     #get a list of all genomes that are described as pathogens
     select * from genomeproject where patho_status = 'pathogen'
     
     #get all genes with name "dnaA"
     select * from gene where gene_name='dnaA'
     
    

Installing 3rd party MySQL programs

  • If you are not as familiar with MySQL syntax and would like a more pretty interface, then you can use other software to query MicrobeDB.
  1. Using a client desktop application such as MySQL Workbench

    • This is a simple to install and free software package provides many features which make querying MySQL databases easier.
  2. Using a web based application such as phpMyAdmin

    • phpMyAdmin is more difficult to install, but once it is it allows a web-based method to search and interact with your database.

Programming with the MicrobeDB API

  • If you know how to program in Perl you can use the MicrobeDB Perl API which allows you to retrieve data without constructing MySQL queries.

  • Example of a simple perl script using the MicrobeDB API that searches for all 'recA' genes and prints them in 'Fasta' format:

      #Import the MicrobeDB API
      use lib '/your/path/to/MicrobeDB';
      use MicrobeDB::Search;
    
      #intialize the search object
      $search_obj= MicrobeDB::Search();
    
      #create the object that has properties that mus
    
View on GitHub
GitHub Stars24
CategoryData
Updated2y ago
Forks13

Languages

Perl

Security Score

75/100

Audited on Nov 2, 2023

No findings