Mendelianrandomization
MendelianRandomization is an R package for assessing causal relationships using genetic variants as instrumental variables
Install / Use
/learn @matthijsknigge/MendelianrandomizationREADME
Mendelian Randomization
Mendelian Randomization (MR) is the process that refers to the random segregation and assortment of genes from ancestors to offspring that takes place during gamete formation and gives a method of using genetic variants to make casual inferences regarding the relationship between exposure and outcomes. The basic principle utilized in the MR pipeline, is that if a genetic variant either alters the level of or mimics the biological effects of a exposure that itself alters disease risk, then these genetic variants should be related to disease risk.The goal of MR studies is to provide evidence for or against a causal relationship between a exposure and a disease. Genetic variants are used because these are less susceptible for confounding because of it is subjected to Mendel’s first law, the law of of segregation. These genetic variants segregate independently and randomly from environmental factors, and it can be assumed that genetic variants segregate independently from other traits.
The MR approach is quite similar to Randomized Controlled Trials, in where there is a population sample which is divided randomly into two arms to evenly distribute potential confounders, upon both of these arms are some form of experiment is conducted, where one group is the case and the other is the control. In the next step the effects take place which are on-, off-target effects which in turn will be compared to the control group and by which researchers are capable of doing a study. Due to this on-, off-target effects the direction of relationship is difficult to measure. The random segregation of genes in the Mendelian dogma is a natural way of dividing the population into two arms where instead the case group has any given genotype, and the control has any given genotype that differs. These different genotypes result in different products, which in turn can be measured and compared. And this product is free from the effects of confounding and reverse causality because the germline precedes the disease of interest. And the genetic variants segregate randomly and independently. Mendelian Randomization gives us the power to use genetic variants in observational settings to make causal inferences regarding the relationship between an exposure and an outcome, and for this the summary statistics from Genome-Wide-Association-Studies (GWAS) can be used. See figure 1.
The Mendelian Framework is quite straight forward, in order to test if there is a causal relationship between an exposure for example cholesterol, and outcome for example celiac disease we need from both variables a GWASs, and before interpreting the results from a Mendelian Randomization analysis, the genetic variants must be tested on three criteria:
- the genetic variant must be associated with the exposure of interest
- the genetic variant must not be associated with confounders
- the genetic variant can only be associated with the outcome through the exposure
The first assumption can be verified by examining the strength of the association of the genetic variant with the exposure. This can be met by for selecting genome-wide significance on genetic variants. For the second assumption examine the possible relationship between a genetic variant and a measured confounder. And for the third assumption, this is very problematic because it is very difficult to prove that the genetic variant is associated with the outcome through the exposure instead of through some other biological pathway. See figure 2.
Figure 1: An overview of the Mendelian Randomization approach | Figure 2: An overview of the Mendelian Randomization framework
:-------------------------------:|:------------------------------------:
| 
This package provides functionality for the following operations:
-
Calculate the standard deviation from the effectsize or log odd score when it is not present.
mr.calculate.se() -
Clumping, for pruning SNPs that are in linkage disequilibrium (LD). ALso is provided a method for finding proxy SNPs for replacing SNPs that are in LD.
mr.clump() -
Cochran's Q test, for determining SNPs that overfit the model, or the ones that introduce pleiotropic effects.
mr.cochran.Q.test() -
Mendelian Randomization Egger method (MR-egger method) for estimating causality, testing causality, and for testing the overall pleiotropy within the data set.
mr.egger.method() -
When your data set misses allelic information, this can be queried by using a reference file.
mr.find.missing.allelic.information() -
Forest plotting the MR analysis, for seeing the overall weight a SNP brings into the study.
mr.forest.plot() -
Funnel plotting the methods used with the MR analysis to detect study bias.
mr.funnel.plot() -
Get the chromosome number and position of SNPs.
mr.get.chr.pos() -
Harmonization of the data set. Align SNPs, remove problematic SNPs, for example palindromic SNPs, mismatch SNPs, and SNPs that have a wrong reference.
mr.harmonize() -
Inverse-Variance Weighted method for averaging the estimate ratios.
mr.inverse.variance.weighted.method() -
A highly fasionable way of plotting MR results.
mr.plot() -
The functionality to pre-process data, which test the data set on missing alleles, missing beta's, selects for genome-wide significance, removes duplicates, and removes alleles from which it is not possible to measure direction.
mr.pre.process() -
qq-plot for the p-value distribution of a chosen method.
mr.qq.p.distribution() -
A normal qq-plot for plotting the theoretical quantiles against the normal quantiles.
mr.qq.plot() -
Remove a certain region within a chromosome.
mr.remove.region() -
Perform wald-ratio for obtaining a causal estimate based on the exposure regression on genotype and the outcome regression of genotype.
mr.wald.ratio() -
Test data for trying the package.
Installing Mendelian Randomization
The package is hosted on bitbucket, and this allows for a smooth installation, and updates are easy to install. Before installing Mendelian Randomization, make sure you have installed devtools:
install.packages("devtools")
And then you are ready to install the mendelianRandomization package:
devtools::install_bitbucket("matthijsknigge/mendelianRandomization")
Other libraries that are needed in this package:
install.packages("stringr")
install.packages("readr")
install.packages("ggplot2")
install.packages("ggExtra")
install.packages("gridExtra")
install.packages("latex2exp")
This package needs R version 3.2.0 or greater.
Tutorial
The package also contains test data for doing a basic Mendelian Randomization analysis. The first step is to read the data. For this analysis we want to infer causality between an exposure and outcome. In this setup the exposure is Inflammatory bowel disease, and the outcome is Celiac Disease.
# the outcome
data("celiac")
outcome <- celiac
# the exposure
data("Inflammatory.bowel.disease")
exposure <- Inflammatory.bowel.disease
Lets check out the data.
head(outcome)
Here we se a column with SNP identifiers, the effect allele, the effectsize, the pvalue, and the standard deviation.
|SNP |effect_allele |Z_OR |P |se | |-----------|--------------|-----------|----------|----------| |rs61733845 |T | 0.0353671| 0.4249000| 0.0443226| |rs1320571 |A | 0.0188218| 0.6590000| 0.0426513| |rs9729550 |A | 0.1004835| 0.0000025| 0.0213295| |rs1815606 |G | 0.0677437| 0.0007151| 0.0200204| |rs7515488 |T | -0.1028082| 0.0001195| 0.0267232| |rs11260562 |A | -0.0393647| 0.3344000| 0.0407802|
head(exposure)
Here there is a column with the SNP identifiers, effect_allele, the effectsize, the standard deviation of the genetic effect, and the p value.
|SNP |effect_allele | beta| se| pval| |------|-----------|--------------|----------|----------|------| |13665 |rs1003342 |NA | NA| NA| 0e+00| |13666 |rs10051722 |A | 0.0616269| 0.0107204| 0e+00| |13667 |rs10061469 |A | 0.0518248| 0.0105946| 1e-06| |13668 |rs10065637 |G | 0.0686809| 0.0128937| 1e-07| |13669 |rs10142466 |NA | NA| NA| 0e+00| |13671 |rs10486483 |A | 0.0602257| 0.0102041| 2e-07|
We need both files at least to contain the SNP id, beta, se, pval, effect allele. And the exposure must containt both alleles to infer in what the direction the effect takes place. Since the other allele is missing for the exposure, we have to query it. The alleles are queried with mr.find.missing.alleles.
exposure <- mr.find.missing.allelic.information(data = exposure,
thousand.G = "/path/to/reference.bim")
head(exposure)
| |SNP |effect_allele | beta| se| pval|other_allele | |------|-----------|--------------|----------|----------|------|-------------| |13665 |rs1003342 |NA | NA| NA| 0e+00| | |13666 |rs10051722 |A | 0.0616269| 0.0107204| 0e+00|C | |13667 |rs10061469 |A | 0.0518248| 0.0105946| 1e-06| | |13668 |rs10065637 |G | 0.0686809| 0.0128937| 1e-07| | |13669 |rs10142466 |NA | NA| NA| 0e+00| | |13671 |rs10486483 |A | 0.0602257| 0.0102041| 2e-07|G |
Now that we have what we need, we can start pre-processing the data. But first let's check out how many SNPs we have for exposure and outcome.
# exposure amount of SNPs
length(exposure$SNP)
> 196
# outcome amount of SNPs
length(outcome$rsid)
> 97434
The next step is to pre-process the exposure, and
