RASQUAL

RASQUAL (Robust Allele Specific QUAntification and quality controL) maps QTLs for sequenced based cellular traits by combining population and allele-specific signals.

Updates

09-08-2018 : Introduced assay specific QC filters in createASVCF.sh.

How to build & install

Please make sure CLAPACK and GSL (GNU Scientific Library) are installed in your environment (if you don't have them, then see below for installation tips). GSL is usually installed in /usr directory. Please check /usr/include/gsl and /usr/lib/libgsl.a are existing.

To build and install RASQUAL, firstly go to the source directory (src), then set environment variables appropriately to point to the CLAPACK library and GSL. Finally use "make" to build and install RASQUAL which will be installed in "$RASQUALDIR/bin".

RASQUALDIR=/path/to/rasqualdir/
cd $RASQUALDIR/src
# Not run!  Please export your environment.
export CFLAGS="-I/path/to/your/CLAPACK-*.*.*.*/INCLUDE -I/path/to/your/CLAPACK-*.*.*.*/F2CLIBS -I/path/to/your/gsl-*.*"
export LDFLAGS="-L/path/to/your/CLAPACK-*.*.*.* -L/path/to/your/CLAPACK-*.*.*.*/F2CLIBS -I/path/to/your/gsl-*.*/lib"
make
make install

QTL mapping with RASQUAL

This section describes how to prepare the input files for RASQUAL. If you prefer, Kaur Alasoo has now developed a set of tools in Python that hopefully simplify the process of creating RASQUAL input files <a href="https://github.com/kauralasoo/rasqual/tree/master/rasqualTools">here</a>. RASQUAL needs two input data files:

A fragment (read) count table, with sample specific offsets (such as library size)
A VCF file with *phased* SNP genotypes and allele-specific counts.

An example of each of these files can be found in the data directory. In the usage example below, RASQUAL takes SNP information from a tabix-indexed VCF file as standard input, while the count table and sample specific offsets are binary files (Y.bin and K.bin, respectively). Tabix-indexing is not strictly necessary but allows for genotype and allelic count information to be accessed quickly from the command line. Using the example data files, you can use the following commands to map expression QTLs for two genes (C11orf21 and TSPAN32) using RASQUAL:

# make sure tabix is installed in your environment
cd $RASQUALDIR
tabix data/chr11.gz 11:2315000-2340000 | bin/rasqual -y data/Y.bin -k data/K.bin -n 24 -j 1 -l 378 -m 62 \
    -s 2316875,2320655,2321750,2321914,2324112 -e 2319151,2320937,2321843,2323290,2324279 \
    -t -f C11orf21 -z
tabix data/chr11.gz 11:2315000-2340000 | bin/rasqual -y data/Y.bin -k data/K.bin -n 24 -j 2 -l 378 -m 60 \
    -s 2323227,2323938,2324640,2325337,2328175,2329966,2330551,2331219,2334884,2335715,2338574,2339093 \
    -e 2323452,2324188,2324711,2325434,2328220,2330040,2330740,2331248,2334985,2337897,2338755,2339430 \
    -t -f TSPAN32 -z

Sample size (in this example, N=24) is given by -n option and the feature ID is given by -j option. Here only two genes exist in this example, thereby j=1, 2. In reality, you may have e.g. >10,000 features in your data, of which you may want to map QTL e.g. for the 12,345th feature, you must set -j 12345. You need to provide the number of testing SNPs and feature SNPs in the cis-window a priori (-l and -m, respectively). RASQUAL also requires the feature start and end positions (as comma separated values without space in ascending order for more than one positions, e.g. such as for a union of exons in this example) as inputs (-s and -e, respectively). By default, RASQUAL outputs QTL mapping results for all tested SNPs, but you can also specify only the lead QTL SNP (-t option). In the output, you can also specify the feature name by -f option. To take account of genotype uncertainty, imputation quality score (R square value) are used in this example (-z option; see the section below).

Output

On output, RASQUAL provides the following values for each tested SNP:

Feature ID*
rs ID*
Chromosome*
SNP position*
Ref allele
Alt allele
Allele frequency (not MAF!)*
HWE Chi-square statistic
Imputation quality score (IA)
Log_10 Benjamini-Hochberg Q-value
Chi square statistic (2 x log Likelihood ratio)*
Effect size (Pi)
Sequencing/mapping error rate (Delta)
Reference allele mapping bias (Phi)
Overdispersion
SNP ID within the region
No. of feature SNPs
No. of tested SNPs*
No. of iterations for null hypothesis
No. of iterations for alternative hypothesis
Random location of ties (tie lead SNP; only useful with -t option)
Log likelihood of the null hypothesis
Convergence status (0=success)
Squared correlation between prior and posterior genotypes (fSNPs)
Squared correlation between prior and posterior genotypes (rSNP)*

You may need columns with (*) for the downstream analysis.

Data preparation

You can find an example expression data for C11orf21 and TSPAN32 genes in the data directory. There are two files: a read/fragment count table (Y.txt) and sample specific offset data (K.txt), both have to be organised in feature by sample format (i.e., row: gene; col: sample). We have included an R script to allow you to convert the count and offset files (text) into binary format for us by RASQUAL. The script converts a table data into a vector of double precision values.

cd $RASQUALDIR
RHOME=/software/R-3.0.0/bin/
$RHOME/R --vanilla --quiet --args data/Y.txt data/K.txt < R/txt2bin.R > log

Note here that, the row number of the fragment count table corresponds to the feature ID given by -j option previously explained. For example, if you set -j 12345 in the RASQUAL command, you would map QTL for the 12,345th feature (row) from the top of the fragment count table.

You will also need to prepare custom VCF files containing the allele specific counts of your target cellular trait at all SNPs. The files have to contain an additional subfield, "AS", located within the genotype field consisting of two integers, the reference and alternative allele counts, separated by a comma. For example, sample i is heterozygous at a SNP and has 1 and 10 reads overlapping the reference and alternative alleles respectively, the genotype field for the sample becomes

... FORMAT ... Sample_i ...
... GT:AS  ... 0|1:1,10 ...

An example VCF file (chr11.gz) can be found in the data directory. Note that, phased genotypes are required for QTL mapping with RASQUAL. Currently, SNP genotypes are only used to map QTLs, but short INDELs and some form of structural variations will be able to use shortly.

Genotype uncertainty

Note on QC for genotype error correction: We have found that, in rare cases, RASQUAL may aggressively overcorrect genotyping errors to obtain a higher likelihood ratio. To detect this, we recommend that you always check the squared correlation between prior and posterior rSNP genotypes on the 25th column of the output. Cases where there is a very large change in genotypes between the prior and posterior should be treated with caution. Another approach that you can use to detect these instances, is to stop updating posterior genotypes (--no-posterior-update) or use the nominal genotype 0, 1 and 2 (--fix-genotype option) and compare the Chi-square statistics with and without genotype correction. Cases where weakly significant QTLs become highly significant with genotype error correction should be treated with caution.

There are 4 options to incorporate genotype uncertainty in RASQUAL:

Allelic probability (AP)

Allelic probability can be obtained from the standard 2-step imputation scheme, where genotypes are first phased then imputed on haplotype-by-haplotype basis. The custom AP field consists of the allelic probabilities (in Log10 scale) of two haplotypes from one individual, separated by a comma:
```
 ... FORMAT ... Sample_i      ...
 ... AP:AS  ... 0.0,-5.0:1,10 ...
```
Genotype likelihood (GL & GP)

You may also use genotype likelihood (in Log10 scale) from conventional genotype imputation in conjunction with phased genotype data:
```
 ... FORMAT    ... Sample_i                ...
 ... GT:GL:AS  ... 0|1:-4.0,-0.1,-0.6:1,10 ...
```
Likewise, you can also provide the genotype likelihood in nominal scale [0-1] with GP FORMAT:
```
 ... FORMAT    ... Sample_i                ...
 ... GT:GP:AS  ... 0|1:0.01,0.99,0.0:1,10 ...
```
Dosage (DS)

Genotype dosage can also be utilised to take account of genotype uncertainty:
```
 ... FORMAT    ... Sample_i     ...
 ... GT:DS:AS  ... 0|1:1.1:1,10 ...
```
Imputation quality score (R square value; RSQ)

Imputation methods often provide a quality score for each SNP locus that approximates squared correlation coefficient between true and observed genotypes (e.g., R^2 from MaCH or Beagle; I^2 from IMPUTE2 ). RASQUAL can convert the score into genotyping error rate to handle uncertainly:
```
 ... INFO            FORMAT ... Sample_i ...
 ... ...;RSQ=0.9;... GT:AS  ... 0|1:1,10 ...
```
If you want to prioritise RSQ, you need to specify -z option for RASQUAL (see above example).
Population allele frequency

If genotype information is not available, RASQUAL can run in SNP free mode (--population-allele-frequency option). In this case, you just need to provide the population allele frequency (not minor allele frequency!) for the SNP in VCF files. Note that, this mode is only feasible when the causal SNP is a feature SNP (e.g., ChIP-seq QTL).
```
 ... INFO           FORMAT ... Sample_i ...
 ... ...;AF=0.4;... AS     ... 1,10     ...
```

We strongly recommend using the AP field for QTL mapping. If there are multiple subfields of AP, GL and DS, AP is prioritised over both GL and DS, and GL is prioritised over DS. If nei

Rasqual

Install / Use

README