MaAsLin User Manual

MaAsLin is a multivariate statistical framework that finds associations between clinical metadata and potentially high-dimensional experimental data.

If you use MaAsLin, please cite the paper the MaAsLin methodology was initially presented in: Morgan XC, Tickle TL, Sokol H, Gevers D, Devaney KL, Ward DV, Reyes JA, Shah SA, LeLeiko N, Snapper SB, Bousvaros A, Korzenik J, Sands BE, Xavier RJ, Huttenhower C. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 2012 Apr 16;13(9):R79.

If you have questions, please email the MaAsLin bioBakery Support Forum.

For more information, see the MaAsLin Tutorial.

Description
Requirements
Installation
Run a Demo
How to Run
Troubleshooting
How to Run in Galaxy
Related Projects and Scripts

Description

MaAsLin is a multivariate statistical framework that finds associations between clinical metadata and potentially high-dimensional experimental data. MaAsLin performs boosted additive general linear models between one group of data (metadata/the predictors) and another group (in our case relative taxonomic abundances/the response). In our context we use it to discover associations between clinical metadata and microbial community relative abundance or function; however, it is applicable to other data types.

Metagenomic data are sparse, and boosting is used to select metadata that show some potential to be useful in a linear model between the metadata and abundances. In the context of metadata and community abundance, a sample's metadata is boosted for one Operational Taxonomic Unit (OTU) (Yi). The metadata that are selected by boosting are then used in a general linear model, with each combination of metadata (as predictors) and OTU abundance (as response variables). This occurs for every OTU and metadata combination. Given we work with proportional data, the Yi (abundances) are arcsin(sqrt(Yi)) transformed. A final formula is as follows:

Requirements

MaAsLin requires the following R packages: agricolae, gam (version 1.14), gamlss, gbm, glmnet, inlinedocs, logging (version 0.7-103), MASS, nlme (version 3.1-127), optparse, outliers, penalized, pscl, robustbase

Please install these packages before installing MaAsLin. Here is an example script for installing dependencies (execute from within a fresh R session)

# Load all required packages at once
if(! require("pacman")) install.packages("pacman", repos='http://cran.us.r-project.org')
suppressPackageStartupMessages(library("pacman"))
pacman::p_load('devtools', 'agricolae', 'gamlss', 'gbm', 'nlme', 'gam', 'glmnet', 'inlinedocs', 'logging', 'devtools','MASS', 'optparse', 'outliers','penalized', 'pscl', 'robustbase')

# Install older version of gam
if (!packageVersion('gam')=='1.14') {
remove.packages('gam')
devtools::install_version("gam", version = "1.14", repos = "http://cran.us.r-project.org")
}

# Install older version of logging
if (!packageVersion('logging')=='0.7-103') {
remove.packages('logging')
devtools::install_version("logging", version = "0.7-103", repos = "http://cran.us.r-project.org")
}

# Install older version of nlme
if (!packageVersion('nlme')=='3.1-127') {
remove.packages('nlme')
devtools::install_version("nlme", version = "3.1-127", repos = "http://cran.us.r-project.org")
}

Installation

Download the latest version of MaAsLin.
Install MaAsLin (where X.Y.Z is the version number) from the command line: $ R CMD INSTALL Maaslin_X.Y.Z.tar.gz. Alternatively, install from within a fresh R session as:

install.packages("https://github.com/biobakery/maaslin/downloads/Maaslin_X.Y.Z.tar.gz", repo=NULL, type="source")

Run a Demo

Run the demo included in the MaAsLin install.

$ R
> library(Maaslin)
> example(Maaslin)

How to Run

Input Files

There are 2 essential input files: the ".read.config" file and the "input data" file, and an optional ".R script" . Details of each file follow:

** 1. Input data File **

Required input file which we call the PCL file. This file that contains all the data and metadata. This file is formatted so that metadata/data (otus or bugs) are rows and samples are columns. All metadata rows should come first before any abundance data. The file should be a tab delimited text file. A demo PCL file is found in the MaAsLin download in maaslin/inst/extdata/ .

A PCL file is a TEXT delimited file similar to an excel spread sheet with the following characteristics.

Rows represent metadata and features (bugs), columns represent samples.
The first row by default should be the sample ids.
Metadata rows should be next.
Lastly, rows containing features (bugs) measurements (like abundance) should be after metadata rows.
The first column should contain the ID describing the column. For metadata this may be, for example, "Age" for a row containing the age of the patients donating the samples. For measurements, this should be the feature name (bug name).
By default the file is expected to be TAB delimited.
If a consensus lineage or hierarchy of taxonomy is in the feature name, the default delimiter between clades is the pipe ("|").

** 2. Read Config File **

Required input file *.read.config. A read config file allows one to indicate what data is read from a PCL file without having to change the pcl file or change code. This means one can have a pcl file which is a superset of metadata and abundances which includes data you are not interested in for the run. This file is a text file with ".read.config" as an extension. This file is later described in detail in section **Process Flow ** subsection 4. Create your read.config file.

** 3. R Script File (Optional) **

Optional input file *.R. The R script file is using a call back programming pattern that allows one to add/modify specific code to customize analysis without touching the main MaAsLin engine. A generic R script is provided "maaslin_demo2.R" and can be renamed and used for any study. The R script can be modified to add quality control or formatting of data, add ecological measurements, or other changes to the underlying data before MaAsLin runs on it. This file is not required to run MaAsLin.

Updates in the new release

The program now can use as inputs either a PCL file or a TSV file. It detects the type of the file by the suffix of the file - *.tsv or *.pcl

**Example of running with a tsv file: **

./R/Maaslin.R inst/extdata/maaslin_demo2.tsv output1 -i inst/extdata/maaslin_demo2.read.config

**Example of running with a pcl file: **

./R/Maaslin.R inst/extdata/maaslin_demo2.pcl output2 -i inst/extdata/maaslin_demo2.read.config

**The config file is optional - Can be generated dynamically **

If you pass the number of the row of the LAST Metadata (For a pcl file) or the number of the column of the LAST METADATA (For a tsv file) using the parameter --lastMetadata= the Config file gets generated for you.

Example of syntax to generate the config file automatically for a pcl file:

./R/Maaslin.R inst/extdata/maaslin_demo2.pcl output3 --lastMetadata=9

Example of syntax to generate the config file automatically for a tsv file:

./R/Maaslin.R inst/extdata/maaslin_demo2.tsv output4 --lastMetadata=9

Process Flow

** 1. Obtain your abundance or relative function table. **

Abundance tables are normally derived from sequence data using Mothur, Qiime, HUMAnN, or MetaPhlAn. Please refer to their documentation for further details.

** 2. Obtain your metadata. **

Metadata would be information about the samples in the study. For instance, one may analyze a case / control study. In this study, you may have a disease and healthy group (disease state), the sex of the patents (patient demographics), medication use (chemical treatment), smoking (patient lifestyle) or other types of data. All aforementioned data would be study metadata. This section can have any type of data (factor, ordered factor, continuous, integer, or logical variables). If a particular data is missing for a sample for a metadata please write NA. It is preferable to write NA so that, when looking at the data, it is understood the metadata is missing and it's absence is intentional and not a mistake. Often investigators are interested in genetic measurements that may also be placed in the metadata section to associate to bugs.

Please note that no special characters are allowed in the metadata header names. These names should only contain alphanumeric characters with the addition of a period and an underscore: [a-zA-Z0-9.-].

If you are not wanting to manually add metadata to your abundance table, you may be interested in associated tools or scripts to help combine your abundance table and metadata to create your pcl file. Both require a specific format for your metadata file. Please

Maaslin

Install / Use

README