Qiime16sTutorial
A tutorial on methods of 16S analysis with QIIME 1
Install / Use
/learn @alexcritschristoph/Qiime16sTutorialREADME
A Tutorial for up-to-date, simple, and robust 16S analysis with QIIME
Note 2: This is 4 years old - much more up to date resources exist today, only preserving for legacy purposes.
Note: In the past 2+ years since I made this tutorial, the field has continued to change at the fast pace it always has. If you are doing 16s analysis, you should use QIIME2 instead of this QIIME1 tutorial and exact sequence variants instead of OTUs. I am preserving this tutorial because I think it is a good introduction to the same general pipeline for analysis of 16S data, but do not recommend following it exactly anymore because it is outdated. I also cannot seriously recommend the Greengenes database for environmental microbiology work: while the core structure of that database hasn't changed since 2013, an enormous number of novel bacterial phyla have been discovered and tens of thousands of bacterial genomes have been added to the databases since then. SILVA seems significantly better and should be the initial choice when doing 16S work, but is also missing a large number of novel bacterial genomes that are often highly abundant in many ecosystems (e.g., the extremely abundant Rokubacteria from soil and the Melainabacteria from soil and the human gut). It is difficult to understand why this is the case, but there isn't a 16S dataset that accurrately reflects what we know about microbial diversity for the time being.
Methods of 16S sequencing data analysis have evolved and changed rapidly over the past few years, making most available online tutorials for QIIME out of date with respect to either sequencing technology, QIIME syntax and scripts, or best practices in statistical analysis. On top of that, the sheer number of scripts and methods packaged with QIIME (or other microbial analysis pipelines!) might be overwhelming to someone new to the field. This page intends to be an accessible and straightforward introduction to how to analyze 16S sequencing data using statistically robust and current methods. I have attempted to base the format and structure of this in a question / hypothesis framework, so that each section is primarily concerned with how to find the answer to a particular question or hypothesis about the microbiota you are studying. Created in the DiRuggiero lab at Johns Hopkins
I will cover how to approach and answer the following hypotheses and questions:
1. Proportionally, what microbes are found in each sample community?
2. How many species are in each sample?
3. Are there species significantly more abundant in one set of samples than in another?
4. How much does diversity change between samples?
5. Do different sample groupings significantly differ in their microbial composition?
6. Which species abundances are significantly correlated with an environmental variable?
7. Do environmental differences between samples correlate with microbial composition?
8. How often are species found together, and in which samples are they found?
Links and Tools
- QIIME Scripts
- Phyloseq
- Stability of operational taxonomic units: an important but neglected property for analyzing microbial diversity. He Y, Caporaso JG, Jiang XT, Sheng HF, Huse SM, Rideout JR, Edgar RC, Kopylova E et al. Microbiome 2015, 3:20 (20 May 2015)
- McMurdie, P. J., & Holmes, S. (2014). Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol, 10(4), e1003531.
- Robust methods for differential abundance analysis in marker gene surveys. JN Paulson, 2013.
- PiCRUST
A note before we begin
There's a lot of mostly well-mannered disagreement over what are the "best" methods for analyzing 16S sequencing, especially when it comes to picking OTUs. I have found that the methods below are definitely not "noise-free", especially if open reference OTU picking is run on samples with high depth with many novel OTUs. Likely the best thing to do, particularly if you have the time, is to re-analyze the same data using different pipelines and compare the quality of the results. In particular, you may also be interested in the mothur MiSeq SOP and the DADA2 pipeline tutorial.
Generating Your Data
For the purpose of this tutorial, we'll be using a small dataset of 10 samples of 16S Illumina sequences from microbial communities inhabiting 3 different rock/soil environments: Luna (Calcite rock), Ignimbrite rock, and Soil (SAT) environments. All sequences from these samples have been combined into a single seqs.fna FASTA file for our analysis. If you want to follow along with the tutorial, you can download this git repo using the button on Github to do so, or by running git clone. Example output of all of the functions run in this tutorial are included in the tutorial_output folder.
Creating a mapping file
Before analyzing a set of samples, creating a mapping file is useful for the purpose of thinking about experimental design and hypothesis testing. The mapping file for QIIME includes information about your sequencing files and their associated metadata. It should be a tab-delimited text file - you can make it in Excel.
The columns SampleID, BarcodeSequence, LinkerSequence, and Description are required for each sample. SampleIDs should refer to the sequence headers used in your FASTA files. You can add other columns of metadata as needed - Description should always be the last column. The mapping file for this example is saved as example_map.txt in this repository as a reference.
Picking Open Reference OTUs
The initial step we will perform in our analysis is OTU-picking. There are several OTU picking strategies in QIIME, and for almost all single-experiment analyses, it would be best to use open reference OTU picking. OTUs are Operating Taxonomic Units, clusters of sequences that are at least X% identical, where X is generally 97%. OTUs are not a perfect method of describing the data, but are a very widely used one. "Open reference" picking will use a database of known 16S genes to create OTU clusters while also allowing for the formation of OTUs which have sequences sufficiently different from the references.
Note: Why do we use open-reference picking here? See Stability of operational taxonomic units: an important but neglected property for analyzing microbial diversity. Also check out the QIIME page on OTU picking for cases in which you'd want to use closed reference OTU picking. I highly doubt that you'll be in a situation where you have to, which is mainly restricted to comparisons between difference sequencing regions.
This pipeline and its implementation in QIIME are definitely not controversy free - consider reading De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units. and run a comparison of your data with the mothur average neighbor clustering algorithm.
You'll need to download the GreenGenes database of 16S sequences for this step, which is the database of reference 16S sequences we'll use to assign taxonomy. You'll need the file 97_otus.fasta, which functions as a FASTA file of all reference sequences with known taxonomy. Because as of September 2015 the latest GreenGenes is from 2013 (there may be an update soon), if you are interested in specific rare taxa discovered since 2013, you may want to add 16S sequences from those organisms to the 97_otus.fasta file manually.
To actually perform the OTU picking, you will need either usearch or vsearch. While usearch is widely used for OTU picking, it is closed source and a stand-in open source replacement for it known as vsearch has been developed.
It will be slightly easier to just use the original usearch as opposed to installing vsearch, which is what we will do here. Make sure usearch version 6.1.554 is installed in your path (moved to /usr/bin), by requesting a download from the link above and moving the executable to /usr/bin and renaming it usearch61. Another version may cause an error. You know you have installed usearch correctly when the command "usearch61" runs. If using vsearch, rename the vsearch executable as usearch61 in /usr/bin.
Note: A common mistake is to forget to run
sudo chmod 777 /usr/bin/usearchand/orsudo chmod 777 /usr/bin/usearch61before trying to run usearch for the first time.
Next, we need to create a basic params file for the OTU picking pipeline. There are a number of params that can be included (described here and here. In our working directory, create the file params.txt and edit it to look like:
pick_otus:enable_rev_strand_match True
We are now ready to run the open reference OTU picking pipeline. The command to run the open reference OTU picking pipeline is below.
Related Skills
node-connect
340.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
340.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.2kCommit, push, and open a PR
