MotifRaptor
Explore the effect of genetic variants on transcription factor binding sites
Install / Use
/learn @pinellolab/MotifRaptorREADME
MotifRaptor
Overview
Motivation:
Genome-wide association studies (GWAS) have identified thousands of common trait-associated genetic variants but interpretation of their function remains challenging. These genetic variants can overlap the binding sites of transcription factors (TFs) and therefore could alter gene expression. However, we currently lack a systematic understanding on how this mechanism contributes to phenotype.
Results:
We present Motif-Raptor, a TF-centric computational tool that integrates sequence-based predic-tive models, chromatin accessibility, gene expression datasets and GWAS summary statistics to systematically investigate how TF function is affected by genetic variants. Given trait associated non-coding variants, Motif-Raptor can recover relevant cell types and critical TFs to drive hy-potheses regarding their mechanism of action. We tested Motif-Raptor on complex traits such as rheumatoid arthritis and red blood cell count and demonstrated its ability to prioritize relevant cell types, potential regulatory TFs and non-coding SNPs which have been previously characterized and validated.
Installation
- Set channells
conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge
-
Install from Bioconda
conda create -n motifraptor_env python=3.6 conda activate motifraptor_env conda install -c bioconda motifraptor -
Simple test
Activate the conda environment before running the program.
source activate motifraptor_envMotif Raptor supports two different ways to run it.
(1) Run from the command line (recommended)
MotifRaptor --version(2) Load as a module
python >>>import MotifRaptor >>>MotifRaptor.__version__If you see the version number, congratulations!
Motif-Raptor Modules Overview
MotifRaptor --help
usage: MotifRaptor [-h] [--version]
{preprocess,preprocess_ukbb_v3,celltype,snpmotif,snpfeature,motiffilter,motifspecific,snpspecific,snpmotifradar,snpindex,snpscan,set,info}
...
Analyze motifs and SNPs in the dataset.
positional arguments:
{preprocess,preprocess_ukbb_v3,celltype,snpmotif,snpfeature,motiffilter,motifspecific,snpspecific,snpmotifradar,snpindex,snpscan,set,info}
help for subcommand: celltype, snpmotif, snpfeature,
motiffilter, motifspecific, snpspecific
preprocess Pre-process the summary statistics
preprocess_ukbb_v3 Pre-process the summary statistics from UKBB version 3
TSV files
celltype cell type or tissue type analysis help
snpmotif snp motif test help
snpfeature snp feature help
motiffilter motifs filtering help
motifspecific motifs specific analysis help
snpspecific SNP specific analysis help
snpmotifradar SNP motif radar plot help
snpindex index the SNPs (with flanking sequences) help
snpscan scan SNP database (already indexed) help
set Set Path and Global Values
info Get Informationa and Print Global Values
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
Configure Essential Databases
Database contains essential data for general analysis, including DHS tracks, TF RNA-seq expressions, TF motifs, and TF pre-calucated scores.
Please download the Database.zip from the links shown below. You can choose to download either a testing database or a complete database. The testing database contains all necessary files except that it only includes a handful number of TFs, in order for you to test the tutorial example on a regular machine. The complete database contains a complete list of TFs used in our real-world study, however, you need to have at least 220 G disk space, and we recommend running the programs on a cluster.
You may also choose to download the testing database, and then refer to the section Build complete motif database to compute a full TF list on your own, rather than downloading a big file. In this case, we also recommend running the programs on a cluster.
# A testing database can be downloaded here.
wget https://www.dropbox.com/s/gxeyzgl5m0u55w8/Database.zip
unzip Database.zip
# A complete database can be downloaded here. Make sure you have 220 G disk space.
wget https://www.dropbox.com/s/kp5r82x55tfgawf/Database.zip
unzip Database.zip
Configuration for Motif-Raptor is to set up the absolute paths for general database and motif database.
MotifRaptor set -pn databasedir -pv $PWD/Database/hg19/
MotifRaptor set -pn motifdatabasedir -pv $PWD/Database/hg19/motifdatabase/
Double check the paths are correctly set up.
MotifRaptor info
*For the current database, we only support human genome hg19. The genomic sequence is originally downloaded from UCSC (http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.2bit). *
Skip the next section if you are only doing an example tutorial without changing the TF motif set.
Build a Complete Motif Database
You need this section only when you haven't downloaded the full version of the database with large amount of data in motifdatabase (from previous steps); rather, you would like to pay computing hours to calculate from scratch
(1) Download pfm files and change the motif database directory immediately. (Motif database is a folder that contains a set of TF motif files with JASPAR PFM format.)
wget https://www.dropbox.com/s/hx9av7o16efxmus/motifdatabase.zip
unzip motifdatabase.zip
MotifRaptor set -pn motifdatabasedir -pv $PWD/motifdatabase
Motif-Raptor expects in input TF motif matrices in JASPAR PFM format. TF matrices in this format are stored as simple text flat files e.g.:
A [13 13 3 1 54 1 1 1 0 3 2 5 ]
C [13 39 5 53 0 1 50 1 0 37 0 17 ]
G [17 2 37 0 0 52 3 0 53 8 37 12 ]
T [11 0 9 0 0 0 0 52 1 6 15 20 ]
However TF matrices in other formats (e.g. MEME, TRANSFAC ) can be easily converted to this format with a simple text editor or in batch using the excellent Biopython library. This library offers several well documented parser that can be used for this task, see https://biopython.org/docs/1.75/api/Bio.motifs.html for more information.
(2) Download SNP list from 1000 Genome project
wget https://www.dropbox.com/s/9gztf4mdblc44jo/1000G.EUR.QC.plink.simple.vcf
The SNP list VCF file needs to have the first five columns ('CHR','POS','ID','REF','ALT') as follows:
CHR | POS | ID | REF | ALT ------------ | ------------- | ------------- | ------------- | ------------- 1 | 2483961 | rs2258734 | A | G
(3) Index the SNP list (genome_index is the output folder name which will be also used in next step.) In this step, you will retrieve the flanking sequences centered by each SNP in order to generate indices files using the hg19 genome from the database (currently we only support hg19 as described before, so the vcf file should be also consistent with hg19)
MotifRaptor snpindex -vcf 1000G.EUR.QC.plink.simple.vcf -gi genome_index -p 4
(4) Scan motifs usnig pfm files on this SNP list
MotifRaptor snpscan -gi genome_index -pfm ./motifdatabase/pfmfiles -mo ./motifdatabase/motifscanfiles -p 4
In the output folder, only '.scale' and '.score' files are useful. You may delete intermediate results in those folders.
cd motifdatabase/motifscanfiles/
find ./ -type d -exec rm -rf '{}' \;
Tutorial
step 0. prepare input data (pre-processing) from GWAS summary statistics
Input: GWAS Summary Statistics Run Motif-Raptor from GWAS summary statistics. You may get summary statistics from UKBiobank, published paper, or other resources. These files may provide diffrent information. Please make sure the file contains the following columns. For the score column, Motif-Raptor currently supports pvalue, zscore, or chisquare.
ID | CHR | POS | REF | ALT | SCORE(pvalue, zscore, or chisquare) ------------ | ------------- | ------------- | ------------- | ------------- | ------------- rs2258734 | 1 | 2483961 | A | G | 0.003
Example: Download the original data file from (Okada et al. 2010 Nature), and applying your own p-value cut-offs to define hits and nonhits. By default, p-value cutoff is 5E-8. This data file is ~450M. If your internet is limited, please download the zip file (~100M) and unzip it.
wget https://www.dropbox.com/s/jnmpu63vqnlc0ig/RA_GWASmeta_TransEthnic_v2.txt
#alternative zip file
wget https://www.dropbox.com/s/c194x1z0bhntfbs/RA_GWASmeta_TransEthnic_v2.zip
unzip RA_GWASmeta_TransEthnic_v2.zip
<!--
```
#original link of this data file (you don't need to download it again):
#wget https://grasp.nhlbi.nih.gov/downloads/ResultsOctober2016/Okada/RA_GWASmeta_TransEthnic_v2.txt.gz
#gunzip RA_GWASmeta_TransEthnic_v2.txt.gz
```
-->
In this file, columns 1,2,3,4,5,9 are ID,CHR,POS,REF,ALT,SCORE as defined above. Here the score is pvalue.
MotifRaptor preprocess -gwas RA_GWASmeta_TransEthnic_v2.txt -cn 1,2,3,4,5,9 -st pvalue -th 5E-8
Output: Information for SNP hits and non-hits. hitSNP_list.txt nonhitSNP_list.txt and hitSNP_list.vcf For the example from (Okada et al. 2010 Nature), you can also download our processed results, if you haven't run on your own.
wget https://www.dropbox.com/s/gpnudp1ba4d2gq3/hitSNP_list.txt
wget https://www.dropbox.com/s/
