Smtpred
SMTpred is a program which combines SNP effects or individual scores from multiple traits according to their sample size, SNP-heritability (h2) and genetic correlation (rG), in order to create more accurate polygenic risk scores.
Install / Use
/learn @uqrmaie1/SmtpredREADME
SMTpred
SMTpred is a program which combines SNP effects or individual scores from multiple traits according to their sample size, SNP-heritability (h<sup>2</sup>) and genetic correlation (r<sub>G</sub>), in order to create more accurate polygenic risk scores.
Table of Contents
- Introduction
- Installation
- Simple Example
- Input formats
- Output formats
- Additional options
- LDSC wrapper
- Converting OLS effects to SBLUP effects
- Further examples
Introduction
Summary statistics from multiple genetically correlated traits can be combined to obtain more accuracte estimates of SNP effects for each trait. More accurate SNP effects lead to higher prediction accuracy. This program combines SNP effects from multiple traits in a way that maximizes the expected prediction accuracy. To do so, it requires estimates of sample size and SNP-heritability (h<sup>2</sup>) for each trait, and genetic correlation (r<sub>G</sub>) for all pairs of traits.
It is also possible to first calculate polygenic risk scores for each trait and then combine those, rather than to combine SNP effects for all trait first and then use those combined SNP effects to calculate polygenic risk scores. This can be computationally faster if polygenic risk scores for each trait already exist and will result in the same multi-triat predictor, if there are no missing SNPs.
By default it is assumed that single trait SNP effects are OLS estimates (GWAS profile scores). If instead of OLS estimates they are BLUP or SBLUP estimates, the --blup option can be used to calculate the appropriate weights. Even though the weights will be different under this option, the resulting weighted sum will be very similar, because of changes in the expected variance of the SNP effects or individual scores.
The examples below can be recreted using the files in the data directory. However, since this data is based on traits with low r<sub>G</sub>, it will not necessarily increase prediction accuracy.
Installation
Change into your directory of choice and type git clone https://github.com/uqrmaie1/smtpred.git, or click on the green download button to download the zip file. This will take up around 78 MB. Change into the directory smtpred. With a bit of luck, the example in the next section should run without problems. If it doesn't, make sure python refers to version 2.7 and not 3.x, and that all the necessary libraries are installed.
For example, if the pandas library is not installed, you can try to install it via pip install pandas. If the pip package manager is not installed, you could try to install it via easy_install pip. If that doesn't work due to lacking permission, try adding the option --user.
This has been tested under OS X 10.11.6 and under CentOS release 6.8.
Simple example
Let's say we want to combine traitA, traitB and traitC to create a more accurate predictor for traitA. It is assumed that single-trait predictors for traitA, traitB and traitC already exist, and that N, h<sup>2</sup> and r<sub>G</sub> are known and are 1e5, 0.5 and 0.5, respectively.
python smtpred.py \
--h2 0.5 0.5 0.5 \
--rg 0.5 0.5 0.5 \
--n 1e5 1e5 1e5 \
--scorefiles data/individual_scores/OLS/traitA.profile \
data/individual_scores/OLS/traitB.profile \
data/individual_scores/OLS/traitC.profile \
--out data/individual_scores/wMT-OLS/
This will create a file "multi_trait.score" with columns FID, IID and the multi-trait profile score.
Input formats
SNP effect files
Weighting is performed on SNP effects, if the option --betafiles or --betapath is specified. SNP effect files for each trait all have to be in the same format, and have to have a header line with three required fields: SNP ID (called snp, snpid, rs, rsid; case insensitive), effect allele (called a1; case insensitive) and SNP effect (called beta or b; case insensitive). SNP IDs will be matched on their ID and effect allele a1, and optionally on a2 if it exists. a1 (and a2) have to match exactly among traits, otherwise the SNP will not be used.
If the trait is a disease trait and has odds ratios, beta values can be calculated as log(odds ratio).
--betapath assumes that all files in this directory are PLINK score files.
--betafiles should be followed by space-separated file names (--betafiles trait1.txt trait2.txt).
Score files
Weighting is performed on individual scores, if the option --scorefiles or --scorepath is specified. Score files have to be in the format of the output of PLINK --score (.profile files).
--scorepath assumes that all files in this directory are PLINK score files.
--scorefiles should be followed by space-separated file names (--scorefiles trait1.profile trait2.profile).
Sample size file
A file that contains sample size of each trait (option --nfile). This file has no header and two columns: Trait and sample size. Alternatively sample size input can be provided directly using the option --n.
h<sup>2</sup> file
A file that contains SNP-heritability estimates of each trait (option --h2file). This file has no header and two columns: Trait and SNP-heritability. Alternatively SNP-heritability input can be provided directly using the option --h2 (See examples below).
For disease traits, use heritability estimates on the observed scale. Don't convert them to the liability scale.
r<sub>G</sub> file
A file that contains genetic correlation (r<sub>G</sub>) estimates of each trait (option --rgfile). This file has no header and three columns: Trait 1, Trait 2 and r<sub>G</sub>. Alternatively genetic correlation input can be provided directly using the option --rg.
Order of traits
The order of traits is important, because by default (without the --alltraits option) the program will create a multi-trait predictor for the first trait.
The order of traits is taken from the order in which the score files or beta files are listed, or, if these options are not specified, from the h<sup>2</sup> file or otherwise from the sample size file. If none of these files have been provided as input, they will be sorted alphabetically if the score path or beta path option are specified.
Output formats
SNP effect file
If SNP effects have been provided as input, the file multi_trait.beta contains the multi-trait SNP effects. It has columns for SNP ID, effect allele and multi-trait beta for the trait of interest, which is assumed to be the first trait provided. If multi-trait SNP effects for all traits are of interest, the option --alltraits will result in one column for each trait in the input files.
Score file
If individual scores have been provided as input, the file multi_trait.score contains the multi-trait individual scores. It has columns for FID, IID and multi-trait scores for the trait of interest, which is assumed to be the first trait provided. If multi-trait individual scores for all traits are of interest, the option --alltraits will result in one column for each trait in the input files.
Weights
multi_trait.weights will contain the weights that are used to combine traits. The header line of the file contains the traits that are used to create a multi-trait predictor. Each line contains the weights for creating a multi-trait predictor for one trait, with the first column containing the trait name and the other columns the weights for eaach trait. If --alltraits is specified, the file will have one line for each trait.
Variances
multi_trait.variances will contain expected variances for each trait. This is necessary becasue the weights assume that the variances of the SNP effects are exactly identical to their expectations. Since that is not always the case, each trait is scaled to its expected variance before weighting. For OLS effects the expected variance for each trait is h<sup>2</sup>/mtot + 1/n. For BLUP effects the expected variance for each trait is R<sup>2</sup>/meff, where R<sup>2</sup> = h<sup>2</sup>/(1+meff*(1-R<sup>2</sup>)/(n*h<sup>2</sup>)). Despite the differences in weights and expected variances between OLS and BLUP effects, the combined effect of both will mostly cancel out and the specification of the --blup option will not change the weighted output substantially.
Additional options
--alltraits
This option specifies that multi-trait weighting should be performed for all traits, rather than just for the first trait.
--blup
This option specifies that the input SNP effect or individual scores are estimated using BLUP, rather than OLS (GWAS estimates). This will aff
