MOFA
Multi-Omics Factor Analysis
Install / Use
/learn @bioFAM/MOFAREADME
MOFA: Multi-Omics Factor Analysis
Important notice: MOFA v1 (this repository) is officially depreciated, please switch to MOFA v2
<br>MOFA is a factor analysis model that provides a general framework for the integration of multi-omic data sets in a completely unsupervised fashion.
Intuitively, MOFA can be viewed as a versatile and statistically rigorous generalization of principal component analysis (PCA) to multi-omics data. Given several data matrices with measurements of multiple -omics data types on the same or on overlapping sets of samples, MOFA infers an interpretable low-dimensional data representation in terms of (hidden) factors. These learnt factors represent the driving sources of variation across data modalities, thus facilitating the identification of cellular states or disease subgroups.
Once trained, the model output can be used for a range of downstream analyses, including the visualisation of samples in factor space, the automatic annotation of factors using (gene set) enrichment analysis, the identification of outliers (e.g. due to sample swaps) and the imputation of missing values.
For more details you can read our paper: http://msb.embopress.org/cgi/doi/10.15252/msb.20178124
<p align="center"> <img src="images/logo.png" style="width: 50%; height: 50%"/> </p>News
- 01/01/2020 Beta version of MOFA+ software is available. We recommend all users to switch to MOFA+.
- 01/12/2019 The new version of MOFA (MOFA+) manuscript is published in bioRxiv
- 03/05/2019 MOFA is available in Bioconductor (only for R>=3.6).
- 10/01/2019 Python package uploaded to PyPI (https://pypi.org/project/mofapy/)
- 21/06/2018 Beta version released
- 20/06/2018 Paper published: http://msb.embopress.org/content/14/6/e8124
- 10/11/2017 We created a Slack group to provide personalised help on running and interpreting MOFA, this is the link
Installation
MOFA is run exclusively from R, but it requires some python dependencies.
Python dependencies
Python dependencies can be installed using pip (from the Unix terminal)
pip install mofapy
Alternatively, they can be installed from R itself using the reticulate package:
library(reticulate)
py_install("mofapy", envname = "r-reticulate", method="auto")
MOFAdata R data package
For illustration purposes we provide several data sets that are used in the vignettes of the MOFA package. Can be installed using R:
devtools::install_github("bioFAM/MOFAdata", build_opts = c("--no-resave-data"))
MOFA R package
This is the core software itself. Can be installed using R:
devtools::install_github("bioFAM/MOFA", build_opts = c("--no-resave-data"))
Reticulate configuration
Before running MOFA, you need to make sure that reticulate is pointing to the correct python binary (or conda environment).
This can become tricky when you have multiple conda environments and versions of Python installed:
library(reticulate)
# Using a specific python binary
use_python("/home/user/python", required = TRUE)
# Using a conda enviroment called "r-reticulate"
use_condaenv("r-reticulate", required = TRUE)
For more details on how to set up the reticulate connection, see: https://rstudio.github.io/reticulate/
Tutorials/Vignettes
We currently provide three example workflows:
- Integration of multi-omics cancer data: a cohort of 200 chronic lymphocytic leukaemia patients. This is the main data set analysed in the paper. Load it using
vignette("MOFA_example_CLL"). - Integration of single-cell multi-omics data: single-cell profiling of DNA methylation and RNA expression in roughly 100 pluripotent stem cells. This is the secondary data set analysed in the paper. Load it using
vignette("MOFA_example_scMT"). - Model selection and robustness with simulated data: this tutorial is focused only on how to perform model selection and assess robustness. Load it using
vignette("MOFA_example_simulated")
If you have problems loading the vignettes, you can find the html files here
Cheatsheet
A list with all relevant functions, together with a short description, can be found at the end of the introductory vignette (vignette("MOFA")).
MOFA workflow
<p align="center"> <img src="images/workflow.png"> </p>Step 1: Fitting the model
First you need to create the MOFA object with your input data, and subsequently train the model. If everything is successful, you should observe an output analogous to the following:
#############################################
## Running trial number 1 with seed 642034 ##
#############################################
Trial 1, Iteration 1: time=0.08 ELBO=-345954.96, Factors=10
Trial 1, Iteration 2: time=0.10 ELBO=-283729.31, deltaELBO=62225.6421, Factors=10
Trial 1, Iteration 3: time=0.10 ELBO=-257427.42, deltaELBO=26301.8893, Factors=10
...
Trial 1, Iteration 100: time=0.07 ELBO=-221171.01, deltaELBO=0.0998, Factors=10
Converged!
Step 2: Downstream analysis: disentangle the variability between omics
MOFA disentangles the heterogeneity of a high-dimensional multi-omics data set in terms of a small number of latent factors that capture the global sources of variation. Importantly, MOFA quantififes the variance explained of each of the factors in the different omics. An example is shown in the plot below:
<p align="center"> <img src="images/varExplained.png" style="width: 50%; height: 50%"/> </p>Step 3: Annotation of factors
The next step is to try and interpret what the factors are. We have built a semi-automated pipeline to allow the exploration of the latent space:
(1) Visualisation of the samples in the factor space: as in Principal Component Analysis, it is useful to plot the factors against each other and color the samples using known covariates such as batch, sex, clinical information, etc.
(2) Correlation of factors with (clinical) covariates
(2) Inspection of the loadings: loadings provide a measure of feature importance for each factor.
(3) Feature set enrichment analysis: the inspection of loadings can sometimes be challenging, particularly when having large amounts of features. Summarising genes in terms of biological pathways can be useful in such cases.
Please refer to the vignettes for details on the different analysis.
Step 4: Using the factors in downstream analysis
The latent factors can be used for several purposes, such as:
(1) Non-linear dimensionality reduction: the latent factors can be feed into non-linear dimensionality reduction techniques such as UMAP or t-SNE. This is very powerful because you can detect variability or stratifications beyond the RNA expression!
(2) Imputation: factors can be used to predict missing values, including entire missing assays.
(3) Predicting clinical response: factors can be feed into Cox models to predict patient survival.
(4) Regressing out technical variability: if a factor is capturing an undesired technical effect, its effect can be regressed out from your original data matrix.
(5) Clustering: clustering in the latent space is much more robust than in the high-dimensional space.
(6) factor-QTL mapping: factors are a compressed and denoised representation of your samples. This is a much better proxy for the phenotype than the expression of individual genes. Hence, a very promising area is to do eQTL's with the factors themselves! See this paper for an example (https://www.nature.com/articles/ng.3624).
Again, refer to the vignettes for details on the different analysis.
Frequently asked questions
(Q) How do I normalise the data?
Always try to remove any technical source of variability before fitting the model.
For example, for count-based data such as RNA-seq or ATAC-seq we recommend size factor normalisation + variance stabilisation. For microarray DNA methylation data, make sure that samples have no differences in the average intensity.
If this is not done correctly, the model will learn a very strong Factor 1 that will capture this variability, and more subtle sources of variation will be harder to identify.
We have implemented a function called regressCovariates that allows the user to regress out a covariate using linear models. See the documentation and the CLL vignette for examples.
(Q) I get the following error when installing the R package:
ERROR: dependencies 'pcaMethods', 'MultiAssayExperiment' are not available for package 'MOFA'
You probably tried to install them using install.packages(). These packages should be installed from Bioconductor.
(Q) I get one of the following errors when running MOFA:
AttributeError: 'module' object has no attribute 'core.entry_point
Error in py_module_import(module, convert = convert) :
ModuleNotFoundError: No module named 'mofapy'
First thing: restart R and try again. If the error still holds, this means that either:
(1) you did not install the mofa Python package (see instructions above).
(2) you have multiple python installations and R is not detecting the correct one where mofa is installed. You need to find out the right Python interpreter, which usually will be the one you get when running which python in the terminal. You can test if the mofa packaged is installed by running INSIDE python:
