MotifRaptor

Overview

Motivation:

Genome-wide association studies (GWAS) have identified thousands of common trait-associated genetic variants but interpretation of their function remains challenging. These genetic variants can overlap the binding sites of transcription factors (TFs) and therefore could alter gene expression. However, we currently lack a systematic understanding on how this mechanism contributes to phenotype.

Results:

We present Motif-Raptor, a TF-centric computational tool that integrates sequence-based predic-tive models, chromatin accessibility, gene expression datasets and GWAS summary statistics to systematically investigate how TF function is affected by genetic variants. Given trait associated non-coding variants, Motif-Raptor can recover relevant cell types and critical TFs to drive hy-potheses regarding their mechanism of action. We tested Motif-Raptor on complex traits such as rheumatoid arthritis and red blood cell count and demonstrated its ability to prioritize relevant cell types, potential regulatory TFs and non-coding SNPs which have been previously characterized and validated.

Installation

Set channells

conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge

Install from Bioconda

conda create -n motifraptor_env python=3.6
conda activate motifraptor_env
conda install -c bioconda motifraptor

Simple test

Activate the conda environment before running the program.
```
source activate motifraptor_env
```
Motif Raptor supports two different ways to run it.

(1) Run from the command line (recommended)
```
MotifRaptor --version
```
(2) Load as a module
```
python
>>>import MotifRaptor
>>>MotifRaptor.__version__
```
If you see the version number, congratulations!

Motif-Raptor Modules Overview

MotifRaptor --help

usage: MotifRaptor [-h] [--version]
                
                {preprocess,preprocess_ukbb_v3,celltype,snpmotif,snpfeature,motiffilter,motifspecific,snpspecific,snpmotifradar,snpindex,snpscan,set,info}
                ...

Analyze motifs and SNPs in the dataset.

positional arguments:
{preprocess,preprocess_ukbb_v3,celltype,snpmotif,snpfeature,motiffilter,motifspecific,snpspecific,snpmotifradar,snpindex,snpscan,set,info}
                     help for subcommand: celltype, snpmotif, snpfeature,
                     motiffilter, motifspecific, snpspecific
 preprocess          Pre-process the summary statistics
 preprocess_ukbb_v3  Pre-process the summary statistics from UKBB version 3
                     TSV files
 celltype            cell type or tissue type analysis help
 snpmotif            snp motif test help
 snpfeature          snp feature help
 motiffilter         motifs filtering help
 motifspecific       motifs specific analysis help
 snpspecific         SNP specific analysis help
 snpmotifradar       SNP motif radar plot help
 snpindex            index the SNPs (with flanking sequences) help
 snpscan             scan SNP database (already indexed) help
 set                 Set Path and Global Values
 info                Get Informationa and Print Global Values

optional arguments:
-h, --help            show this help message and exit
--version             show program's version number and exit

Configure Essential Databases

Database contains essential data for general analysis, including DHS tracks, TF RNA-seq expressions, TF motifs, and TF pre-calucated scores.

Please download the Database.zip from the links shown below. You can choose to download either a testing database or a complete database. The testing database contains all necessary files except that it only includes a handful number of TFs, in order for you to test the tutorial example on a regular machine. The complete database contains a complete list of TFs used in our real-world study, however, you need to have at least 220 G disk space, and we recommend running the programs on a cluster.

You may also choose to download the testing database, and then refer to the section Build complete motif database to compute a full TF list on your own, rather than downloading a big file. In this case, we also recommend running the programs on a cluster.

# A testing database can be downloaded here.
wget https://www.dropbox.com/s/gxeyzgl5m0u55w8/Database.zip
unzip Database.zip

# A complete database can be downloaded here. Make sure you have 220 G disk space.
wget https://www.dropbox.com/s/kp5r82x55tfgawf/Database.zip
unzip Database.zip

Configuration for Motif-Raptor is to set up the absolute paths for general database and motif database.

MotifRaptor set -pn databasedir -pv $PWD/Database/hg19/
MotifRaptor set -pn motifdatabasedir -pv $PWD/Database/hg19/motifdatabase/

Double check the paths are correctly set up.

MotifRaptor info

*For the current database, we only support human genome hg19. The genomic sequence is originally downloaded from UCSC (http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.2bit). *

Skip the next section if you are only doing an example tutorial without changing the TF motif set.

Build a Complete Motif Database

You need this section only when you haven't downloaded the full version of the database with large amount of data in motifdatabase (from previous steps); rather, you would like to pay computing hours to calculate from scratch

(1) Download pfm files and change the motif database directory immediately. (Motif database is a folder that contains a set of TF motif files with JASPAR PFM format.)

wget https://www.dropbox.com/s/hx9av7o16efxmus/motifdatabase.zip
unzip motifdatabase.zip
MotifRaptor set -pn motifdatabasedir -pv $PWD/motifdatabase

Motif-Raptor expects in input TF motif matrices in JASPAR PFM format. TF matrices in this format are stored as simple text flat files e.g.:

A [13 13 3 1 54 1 1 1 0 3 2 5 ]
C [13 39 5 53 0 1 50 1 0 37 0 17 ]
G [17 2 37 0 0 52 3 0 53 8 37 12 ]
T [11 0 9 0 0 0 0 52 1 6 15 20 ]

However TF matrices in other formats (e.g. MEME, TRANSFAC ) can be easily converted to this format with a simple text editor or in batch using the excellent Biopython library. This library offers several well documented parser that can be used for this task, see https://biopython.org/docs/1.75/api/Bio.motifs.html for more information.

(2) Download SNP list from 1000 Genome project

wget https://www.dropbox.com/s/9gztf4mdblc44jo/1000G.EUR.QC.plink.simple.vcf

The SNP list VCF file needs to have the first five columns ('CHR','POS','ID','REF','ALT') as follows:

CHR | POS | ID | REF | ALT ------------ | ------------- | ------------- | ------------- | ------------- 1 | 2483961 | rs2258734 | A | G

(3) Index the SNP list (genome_index is the output folder name which will be also used in next step.) In this step, you will retrieve the flanking sequences centered by each SNP in order to generate indices files using the hg19 genome from the database (currently we only support hg19 as described before, so the vcf file should be also consistent with hg19)

MotifRaptor snpindex -vcf 1000G.EUR.QC.plink.simple.vcf -gi genome_index -p 4

(4) Scan motifs usnig pfm files on this SNP list

MotifRaptor snpscan -gi genome_index -pfm ./motifdatabase/pfmfiles -mo ./motifdatabase/motifscanfiles -p 4

In the output folder, only '.scale' and '.score' files are useful. You may delete intermediate results in those folders.

cd motifdatabase/motifscanfiles/
find ./ -type d -exec rm -rf '{}' \;

Tutorial

step 0. prepare input data (pre-processing) from GWAS summary statistics

Input: GWAS Summary Statistics Run Motif-Raptor from GWAS summary statistics. You may get summary statistics from UKBiobank, published paper, or other resources. These files may provide diffrent information. Please make sure the file contains the following columns. For the score column, Motif-Raptor currently supports pvalue, zscore, or chisquare.

ID | CHR | POS | REF | ALT | SCORE(pvalue, zscore, or chisquare) ------------ | ------------- | ------------- | ------------- | ------------- | ------------- rs2258734 | 1 | 2483961 | A | G | 0.003

Example: Download the original data file from (Okada et al. 2010 Nature), and applying your own p-value cut-offs to define hits and nonhits. By default, p-value cutoff is 5E-8. This data file is ~450M. If your internet is limited, please download the zip file (~100M) and unzip it.

wget https://www.dropbox.com/s/jnmpu63vqnlc0ig/RA_GWASmeta_TransEthnic_v2.txt

#alternative zip file
wget https://www.dropbox.com/s/c194x1z0bhntfbs/RA_GWASmeta_TransEthnic_v2.zip
unzip RA_GWASmeta_TransEthnic_v2.zip

In this file, columns 1,2,3,4,5,9 are ID,CHR,POS,REF,ALT,SCORE as defined above. Here the score is pvalue.

MotifRaptor preprocess -gwas RA_GWASmeta_TransEthnic_v2.txt -cn 1,2,3,4,5,9 -st pvalue -th 5E-8

Output: Information for SNP hits and non-hits. hitSNP_list.txt nonhitSNP_list.txt and hitSNP_list.vcf For the example from (Okada et al. 2010 Nature), you can also download our processed results, if you haven't run on your own.

wget https://www.dropbox.com/s/gpnudp1ba4d2gq3/hitSNP_list.txt
wget https://www.dropbox.com/s/

MotifRaptor

Install / Use

README

MotifRaptor

Overview

Motivation:

Results:

Installation

Motif-Raptor Modules Overview

Configure Essential Databases

Build a Complete Motif Database

Tutorial

step 0. prepare input data (pre-processing) from GWAS summary statistics