Songbird
Vanilla regression methods for microbiome differential abundance analysis
Install / Use
/learn @biocore/SongbirdREADME
Songbird
What does Songbird produce?
The primary output from Songbird is a file containing differentials. These describe the log-fold change of features (microbes, metabolites, ...) with respect to certain field(s) in your sample metadata. The most important aspect of these differentials are rankings, which are obtained by sorting a column of differentials from lowest to highest. These rankings give information on the relative associations of features with a given covariate.
For more details, please see the paper introducing Songbird.
What do I need to run Songbird?
Songbird has a lot of parameters you can specify, but the three required parameters are:
- A BIOM table containing feature counts for the samples in your dataset
- A sample metadata file containing the covariates you're interested in studying (this should be a tab-separated file)
- A "formula" specifying the covariates to be included in the model Songbird produces, and their interactions (see <a href="#specifying-a-formula">this section</a> for details)
How do I run Songbird?
You can run Songbird as a standalone tool from the command-line or as a QIIME 2 plugin. Either works!
The "Red Sea" dataset
This README's tutorials use a subset of the Red Sea metagenome dataset as an example dataset. This data is stored in the data/redsea/ directory within this repository. Throughout the rest of this README, we'll just refer to this dataset as "the Red Sea dataset", "Red Sea", or something like that.
Features in this dataset are KEGG orthologs (KOs) -- see the dataset's paper, in particular this section, for more details.
What will this README cover?
This README is divided into a few main sections:
- Using Songbird "standalone"
- Using Songbird through QIIME 2
- Specifying a formula
- Interpreting model fitting information
- Adjusting parameters to get reasonable fitting
- Validating by comparing to null/baseline models
- FAQs
- Visualizing Songbird's differentials
- More information
1. Using Songbird "standalone"
Installation
First off, make sure that conda is installed.
Songbird can be installed from the conda-forge channel as follows:
conda create -n songbird_env songbird "pandas>=0.18.0,<1" -c conda-forge
source activate songbird_env
Sidenote
If you get an error about "redirecting" from that first command, you probably
need to add a \ (a backslash character) before the > and < characters.
This can be a problem if you're using fish as your
shell.
Running Songbird
Let's try running Songbird on the "Red Sea" example data included with this
repository. You can just copy-and-paste the command below into your terminal,
assuming you've activated your songbird_env.
songbird multinomial \
--input-biom data/redsea/redsea.biom \
--metadata-file data/redsea/redsea_metadata.txt \
--formula "Depth+Temperature+Salinity+Oxygen+Fluorescence+Nitrate" \
--epochs 10000 \
--differential-prior 0.5 \
--training-column Testing \
--summary-interval 1 \
--summary-dir results
The output differentials will be stored in results/differentials.tsv.
REQUIRED: Checking the fit of your model
You need to diagnose the fit of the model Songbird creates (e.g. to make sure it isn't "overfitting" to the data). When running Songbird standalone, you can do this using Tensorboard:
tensorboard --logdir .
When you open up Tensorboard in a web browser, it will show plots of the cross validation score and loss. See <a href="#interpreting-model-fitting">this section on interpreting model fitting</a> for details on how to understand these plots, and see <a href="#faqs-standalone">the section of the FAQs on running Songbird standalone</a> for details on how to use Tensorboard.
Once you have a good fit for your model, you need to run a null model for comparison to make sure that your model is more predictive than the null model. See <a href="#validating-null-model">this section on how to run and compare a null model</a>.
A note about metadata and running Songbird standalone
If you have any "comment rows" in your metadata -- for example, a #q2:types
row -- you will need to delete these before running Songbird standalone, since
otherwise standalone Songbird will interpret these rows as actual samples.
(We're planning on addressing this in the future.)
2. Using Songbird through QIIME 2
Installation
First, you'll need to make sure that QIIME 2 is in between (version 2019.7 and 2020.6) is installed before installing Songbird. In your QIIME 2 conda environment, you can install Songbird by running:
pip install songbird
Next, run the following command to get QIIME 2 to "recognize" Songbird:
qiime dev refresh-cache
Importing data into QIIME 2
If your data is not already in QIIME2 format, you can import your BIOM tables into QIIME 2 "Artifacts". Starting from the Songbird root folder, let's import the Red Sea example data into QIIME 2 by running:
qiime tools import \
--input-path data/redsea/redsea.biom \
--output-path redsea.biom.qza \
--type FeatureTable[Frequency]
Running Songbird
With your QIIME 2-formatted feature table, you can run Songbird through QIIME 2 as follows:
qiime songbird multinomial \
--i-table redsea.biom.qza \
--m-metadata-file data/redsea/redsea_metadata.txt \
--p-formula "Depth+Temperature+Salinity+Oxygen+Fluorescence+Nitrate" \
--p-epochs 10000 \
--p-differential-prior 0.5 \
--p-training-column Testing \
--p-summary-interval 1 \
--o-differentials differentials.qza \
--o-regression-stats regression-stats.qza \
--o-regression-biplot regression-biplot.qza
You can add the --verbose option to see a progress bar while this is running.
REQUIRED: Checking the fit of your model
You need to diagnose the fit of the model Songbird creates (e.g. to
make sure it isn't "overfitting" to the data). When you run Songbird
through QIIME 2, it will generate a regression-stats.qza Artifact. This can be
visualized by running:
qiime songbird summarize-single \
--i-regression-stats regression-stats.qza \
--o-visualization regression-summary.qzv
The resulting visualization (viewable using qiime tools view or at
view.qiime2.org) contains two plots.
These plots are analogous to the two
plots shown in Tensorboard's interface (the top plot shows cross-validation
results, and the bottom plot shows loss information). The interpretation of
these plots is the same as with the Tensorboard plots: see <a href="#interpreting-model-fitting">this section on interpreting model fitting</a> for details on how to understand these plots.
Once you have a good fit for your model, you need to run a null model for comparison to make sure that your model is more predictive than the null model. See <a href="#validating-null-model">this section on how to run and compare a null model</a>.
3. Specifying a formula <span id="specifying-a-formula"></span>
Hang on, what is a formula?
A formula specifies the statistical model to be built based on the columns in the metadata file. For example, if a user wanted to build a statistical model testing for differences between disease states while controlling for sex, the formula would look something like:
--formula "diseased+sex"
where diseased and sex are columns in the sample metadata file.
This is similar to the statistical formulas used in R, but the order in which you
specify the variables (a.k.a. column names) is not important. The backend we use here for processing formulas is
called patsy.
The metadata columns used in the --formula can be either numeric or categorical. As you can imagine, there are many possible ways to specify metadata columns in
your formula -- in particular, encoding categorical variables can get tricky.
(For example, should a
nominal
variable like sex be encoded the
same way as an
ordinal
variable like disease_progression? Probably not!)
In this section we'll go over a few common cases, and link to some more general
documentation at the end in case these examples aren't sufficient.
The implicit "reference": how categorical variables are handled <span id="implicit-reference"></span>
Let's say your formula just includes one categorical variable:
--formula "healthy_or_sick"
...where the only possible values of healthy_or_sick are healthy and
sick. The output differentials produced by Songbird will only have two
columns:
Intercepthealthy_or_sick[T.healthy]ORhealthy_or_sick[T.sick].
The second differential column indicates association with the given value:
healthy_or_sick[T.healthy] differentials indicate association with
healthy-valued samples using sick-valued samples as a reference,
and healthy_or_sick[T.sick] differentials indicate association with
sick-valued samples using healthy-valued samples as a reference. You'll
only get one of these columns; the choice of reference value, if left
unspecified, is arbitrary.
The reference value is used as the denominator in the log-fold change
computation of differentials. So, for healthy_or_sick[T.healthy] -- a
differential column where sick-valued samples are implicitly set as the
reference -- the
