MicrobiomeBestPracticeReview
Current Challenges and Best Practice Protocols for Microbiome Analysis using Amplicon and Metagenomic Sequencing
Install / Use
/learn @grimmlab/MicrobiomeBestPracticeReviewREADME
Current Challenges and Best Practice Protocols for Microbiome Analysis using Amplicon and Metagenomic Sequencing
<p style='text-align: justify;'> This review paper (https://doi.org/10.1093/bib/bbz155) aims to provide a comprehensive workflow to perform amplicon and shotgun metagenomics analysis. There are two workflows provided. First workflow for amplicon, using the standard mothur and dada2, and along with it some standard visualization are provided for the processed data. Second workflow for metagenomics, using a variety of tools openly available which have been stitched together to form a usable pipeline.</p>Both the workflows are controlled by bash scripts: amplicon_analysis.sh and metagenomics_analysis.sh. The bash scripts contain functions which call the respective underlying tools. Of-course, the tools have to exist in the system before using them, hence, a function called as check_and_install is written into each script which checks if the tools exists in a certain path or not.</br>
Since the workflows utilize so many different tools, it requires quiet a bit of patience for the download and installation process. Please go through the steps below before you begin to use the workflows.

I. Metagenomic Sequencing Analysis Workflow
Prerequisites
Although the check_and_install function is designed to install and setup the required software on the fly, there are some basic prerequisites that need to be satisfied:
OS
Any Linux based distro should work. We tested the scripts using:
Distributor ID: Ubuntu <br/> Description: Ubuntu 18.04.2 LTS <br/> Release: 18.04 <br/> Codename: bionic <br/>
'lsb_release -a' on a Ubuntu based system.
Hardware:
<p style='text-align: justify;'> It is no secret that the hardware highly influences the speed of the workflows. The most time consuming tasks are the ones that involve assemblies, reference based alignment. A modest configuration consists of 16+cores and 100 GB of RAM with 1TB of diskspace. A majority of the diskspace is occupied by reference databases like nr database, kraken database, etc. Our HW configuration consists of 20 core CPU with 128 GB.</p>Software and packages
<p style='text-align: justify;'> Some software should be installed by the user directly as the workflow depends on a lot of external software. Without these the workflow will fail to run. </p>- gcc, g++
- java
- python2.7 and python3: pip, libraries (sys, os, shutil, stat, re, time, tarfile, operator, math, Bio, argparse)
- perl libraries (Bio)
- R
- git
- metabat: <p style='text-align: justify;'> Install instructions can be found under https://bitbucket.org/berkeleylab/metabat/src/master/README.md. Metabat should be visible in the system PATH.</p>
- checkM (checkm-genome): <p style='text-align: justify;'> Install instructions can be found under https://github.com/Ecogenomics/CheckM/wiki/Installation.</br>
After installation the checkM database needs to be built using https://data.ace.uq.edu.au/public/CheckM_databases/ and building by using
checkm data setRoot PATH_TO_DOWNLOADED_DATABASE</p>
NOTE: Make sure checkM is placed finally under /usr/local/bin
Example data:
The example data for metagenomics workflow is taken from the metaHIT gut survey and can be found at ftp.sra.ebi.ac.uk/vol1/fastq/ERR011/. You can download one or more sample for testing purpose.
Steps to run the Metagenomics workflow (metagenomics_analysis.sh)
1. Preparing databases:
sh prepare_databases.sh
Insert the LINKPATH_DB=/xxx/.../references/ to 'metagenomics_analysis.sh'
LINKPATH_DB=/xxx/.../references/
<p style='text-align: justify;'> The databases form the core of the workflows. Unfortunately, the databases are huge and take a long time to download and index. If these databases already exist in your system pleease modify the scripts with the correct paths. Otherwise choose the missing databases and run </p> `prepare_databases.sh` <p style='text-align: justify;'> where the databases will be installed under the `references` in the current directory. At the end of the preparation of databases a path will shown in the stdout which needs to be plug-in to the `metagenomics_analysis.sh` script (to LINKPATH_DB variable). </p>
The following databases are installed:
-
Human and Mouse reference genome: ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/ ftp://ftp.ncbi.nih.gov/genomes/Mus_musculus/Assembled_chromosomes/seq/
-
Kraken database: http://github.com/DerrickWood/kraken2/archive/v2.0.8-beta.tar.gz (needs to be indexed with kraken2)
-
Megan database: http://ab.inf.uni-tuebingen.de/data/software/megan6/download/
-
NR database: Non-redundant database can be found at ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz (needs to be index with diamond)
-
Metaphlan database: https://bitbucket.org/biobakery/metaphlan2/downloads/mpa_v20_m200.tar (needs to be built with bowtie2)
-
checkM database: Needs to be manually installed (Please check prerequisites).
-
Humann2 database: Downloaded using humann2_databases.
2. Add source path for raw data
<p style='text-align: justify;'> In the `metagenomics_analysis.sh` add the path for the rawdata. Please note that the workflow will make a local copy of the rawdata before proceeding further.</p>SRC_RAWDATA=/path_to_my_rawdata_samples/.../.../
(or use your own unzipped, demultiplexed, paired-end Illumina reads
NOTE: <p style='text-align: justify;'> The sample reads must always be paired-end, demultiplexed and compressed in the*.fastq.gz format. Also the names of the pair must end with *_1.fastq.gz and *_2.fastq.gz. Example: "Sample_1.fastq.gz" and "Sample_2.fastq.gz". </p>
3. Set name of workflow
<p style='text-align: justify;'> Next choose an appropriate name for the analysis in the `metagenomics_analysis.sh` script. All the sub-folders like tools, analysis, rawdata copy, etc will be created under this folder name. </p>NAME=MY_METAGENOMIC_ANALYSIS_EXP
4. Run the workflow
Finally, the workflow is ready to be run
sh metagenomics_analysis.sh
There are messages on the stdout showing the status and progress of the analysis.
<p style='text-align: justify;'> The script consists of several sub-scripts and functions. Each sub-script has its own "check_and_install". The "check_and_install" checks for the tools required to run the respective script and installs them if they are missing.</p>NOTE:<p style='text-align: justify;'>The installation of Megan is an interactive installation and requires the user to input Y/N and memory options(between ~3GB-16GB depending on sample size). We recommend to use default options. Megan will be installed in the user home directory.</p>
Step-by-Step Analysis
<p style='text-align: justify;'> The metagenomics workflow is a time-consuming workflow. Hence, the bash scripts are kept as simple as possible. In order to perform only one type of analysis, you can always comment the remaining functions.</br>For example, the quality control function (run_qc) can be run only once initially and then commented for any further analysis for reruns. </br>
If the appropriate steps have already been run, then these can be commented and other steps can be run. This is of-course, not true for steps dependent on previous outputs. </p>
Brief Description of the Each Step
1. run_qc.sh</br> 2. run_assembly.sh</br> 3. run_coassembly.sh</br> 4. run_reference_analysis.sh</br> 5. run_comparative_analysis.sh</br> 6. run_coverage_bining.sh</br> 7. run_binrefinement.sh</br> 8. run_bin_taxonomic_classification.sh</br> 9. run_bin_functional_classification.sh</br>
1. Quality control (run_qc.sh): <p style='text-align: justify;'> This scripts is running series of steps with different tools to perform quality control. FastQC is used to generate comprehensive report of data quality on raw data. Followed by this is a series of steps including removal of adapters, low quality reads, sequencing artifacts, phix adapters and host contamination is performed using trimmomatic, sickle and bbmap.
NOTE: Its very important to review the QC result and change the parameters in the script based e.g read length and read quality etc. </p>
2. Metagenomic Single Sample Assembly (run_assembly.sh): <p style='text-align: justify;'>In this step genomes from more than one species with nonuniform coverage are de novo assembled in order to characterize these metagenomes. Three assemblers (Megahit,SPAdes and IDBA) are integrated in this step. After the assembly, assembly stats is generated for user to decide which assembler worked best on their data. After the stats, assembly filter is performed to filter contigs with minimum 1000bp length. </p>
3. Metagenomic Coassembly (run_coassembly.sh): <p style='text-align: justify;'> This step is similar to step 2. except that here the samples are assembled in group with Megahit and SPAdes. </p>
4. Reference based analysis (run_reference_analysis.sh): <p style='text-align: justify;'>The use of reference based is bit complicated due to the fact that here we are dealing not with single genome but to the unknown number and distribution. There way to deal with this by using all the available prokaryotic genomes and align them to the reads or use marker gene approach. In this step, different state of art tools like kraken2 and metaphlan2
