Songbird

What does Songbird produce?

The primary output from Songbird is a file containing differentials. These describe the log-fold change of features (microbes, metabolites, ...) with respect to certain field(s) in your sample metadata. The most important aspect of these differentials are rankings, which are obtained by sorting a column of differentials from lowest to highest. These rankings give information on the relative associations of features with a given covariate.

For more details, please see the paper introducing Songbird.

What do I need to run Songbird?

Songbird has a lot of parameters you can specify, but the three required parameters are:

A BIOM table containing feature counts for the samples in your dataset
A sample metadata file containing the covariates you're interested in studying (this should be a tab-separated file)
A "formula" specifying the covariates to be included in the model Songbird produces, and their interactions (see <a href="#specifying-a-formula">this section</a> for details)

How do I run Songbird?

You can run Songbird as a standalone tool from the command-line or as a QIIME 2 plugin. Either works!

The "Red Sea" dataset

This README's tutorials use a subset of the Red Sea metagenome dataset as an example dataset. This data is stored in the data/redsea/ directory within this repository. Throughout the rest of this README, we'll just refer to this dataset as "the Red Sea dataset", "Red Sea", or something like that.

Features in this dataset are KEGG orthologs (KOs) -- see the dataset's paper, in particular this section, for more details.

What will this README cover?

This README is divided into a few main sections:

Using Songbird "standalone"
Using Songbird through QIIME 2
Specifying a formula
Interpreting model fitting information
Adjusting parameters to get reasonable fitting
Validating by comparing to null/baseline models
FAQs
Visualizing Songbird's differentials
More information

1. Using Songbird "standalone"

Installation

First off, make sure that conda is installed.

Songbird can be installed from the conda-forge channel as follows:

conda create -n songbird_env songbird "pandas>=0.18.0,<1" -c conda-forge
source activate songbird_env

Sidenote

If you get an error about "redirecting" from that first command, you probably need to add a \ (a backslash character) before the > and < characters. This can be a problem if you're using fish as your shell.

Running Songbird

Let's try running Songbird on the "Red Sea" example data included with this repository. You can just copy-and-paste the command below into your terminal, assuming you've activated your songbird_env.

songbird multinomial \
	--input-biom data/redsea/redsea.biom \
	--metadata-file data/redsea/redsea_metadata.txt \
	--formula "Depth+Temperature+Salinity+Oxygen+Fluorescence+Nitrate" \
	--epochs 10000 \
	--differential-prior 0.5 \
	--training-column Testing \
	--summary-interval 1 \
	--summary-dir results

The output differentials will be stored in results/differentials.tsv.

REQUIRED: Checking the fit of your model

You need to diagnose the fit of the model Songbird creates (e.g. to make sure it isn't "overfitting" to the data). When running Songbird standalone, you can do this using Tensorboard:

tensorboard --logdir .

When you open up Tensorboard in a web browser, it will show plots of the cross validation score and loss. See <a href="#interpreting-model-fitting">this section on interpreting model fitting</a> for details on how to understand these plots, and see <a href="#faqs-standalone">the section of the FAQs on running Songbird standalone</a> for details on how to use Tensorboard.

Once you have a good fit for your model, you need to run a null model for comparison to make sure that your model is more predictive than the null model. See <a href="#validating-null-model">this section on how to run and compare a null model</a>.

A note about metadata and running Songbird standalone

If you have any "comment rows" in your metadata -- for example, a #q2:types row -- you will need to delete these before running Songbird standalone, since otherwise standalone Songbird will interpret these rows as actual samples. (We're planning on addressing this in the future.)

2. Using Songbird through QIIME 2

Installation

First, you'll need to make sure that QIIME 2 is in between (version 2019.7 and 2020.6) is installed before installing Songbird. In your QIIME 2 conda environment, you can install Songbird by running:

pip install songbird

Next, run the following command to get QIIME 2 to "recognize" Songbird:

qiime dev refresh-cache

Importing data into QIIME 2

If your data is not already in QIIME2 format, you can import your BIOM tables into QIIME 2 "Artifacts". Starting from the Songbird root folder, let's import the Red Sea example data into QIIME 2 by running:

qiime tools import \
	--input-path data/redsea/redsea.biom \
	--output-path redsea.biom.qza \
	--type FeatureTable[Frequency]

Running Songbird

With your QIIME 2-formatted feature table, you can run Songbird through QIIME 2 as follows:

qiime songbird multinomial \
	--i-table redsea.biom.qza \
	--m-metadata-file data/redsea/redsea_metadata.txt \
	--p-formula "Depth+Temperature+Salinity+Oxygen+Fluorescence+Nitrate" \
	--p-epochs 10000 \
	--p-differential-prior 0.5 \
	--p-training-column Testing \
	--p-summary-interval 1 \
	--o-differentials differentials.qza \
	--o-regression-stats regression-stats.qza \
	--o-regression-biplot regression-biplot.qza

You can add the --verbose option to see a progress bar while this is running.

REQUIRED: Checking the fit of your model

You need to diagnose the fit of the model Songbird creates (e.g. to make sure it isn't "overfitting" to the data). When you run Songbird through QIIME 2, it will generate a regression-stats.qza Artifact. This can be visualized by running:

qiime songbird summarize-single \
	--i-regression-stats regression-stats.qza \
	--o-visualization regression-summary.qzv

The resulting visualization (viewable using qiime tools view or at view.qiime2.org) contains two plots. These plots are analogous to the two plots shown in Tensorboard's interface (the top plot shows cross-validation results, and the bottom plot shows loss information). The interpretation of these plots is the same as with the Tensorboard plots: see <a href="#interpreting-model-fitting">this section on interpreting model fitting</a> for details on how to understand these plots.

3. Specifying a formula <span id="specifying-a-formula"></span>

Hang on, what is a formula?

A formula specifies the statistical model to be built based on the columns in the metadata file. For example, if a user wanted to build a statistical model testing for differences between disease states while controlling for sex, the formula would look something like:

--formula "diseased+sex"

where diseased and sex are columns in the sample metadata file. This is similar to the statistical formulas used in R, but the order in which you specify the variables (a.k.a. column names) is not important. The backend we use here for processing formulas is called patsy.

The metadata columns used in the --formula can be either numeric or categorical. As you can imagine, there are many possible ways to specify metadata columns in your formula -- in particular, encoding categorical variables can get tricky. (For example, should a nominal variable like sex be encoded the same way as an ordinal variable like disease_progression? Probably not!) In this section we'll go over a few common cases, and link to some more general documentation at the end in case these examples aren't sufficient.

The implicit "reference": how categorical variables are handled <span id="implicit-reference"></span>

Let's say your formula just includes one categorical variable:

--formula "healthy_or_sick"

...where the only possible values of healthy_or_sick are healthy and sick. The output differentials produced by Songbird will only have two columns:

Intercept
healthy_or_sick[T.healthy] OR healthy_or_sick[T.sick].

The second differential column indicates association with the given value: healthy_or_sick[T.healthy] differentials indicate association with healthy-valued samples using sick-valued samples as a reference, and healthy_or_sick[T.sick] differentials indicate association with sick-valued samples using healthy-valued samples as a reference. You'll only get one of these columns; the choice of reference value, if left unspecified, is arbitrary.

The reference value is used as the denominator in the log-fold change computation of differentials. So, for healthy_or_sick[T.healthy] -- a differential column where sick-valued samples are implicitly set as the reference -- the

Songbird

Install / Use

README

Songbird

What does Songbird produce?

What do I need to run Songbird?

How do I run Songbird?

The "Red Sea" dataset

What will this README cover?

1. Using Songbird "standalone"

Installation

Sidenote

Running Songbird

REQUIRED: Checking the fit of your model

A note about metadata and running Songbird standalone

2. Using Songbird through QIIME 2

Installation

Importing data into QIIME 2

Running Songbird

REQUIRED: Checking the fit of your model

3. Specifying a formula <span id="specifying-a-formula"></span>

Hang on, what is a formula?

The implicit "reference": how categorical variables are handled <span id="implicit-reference"></span>