Pwas
Proteome-Wide Association Study
Install / Use
/learn @nadavbra/PwasREADME
What is PWAS?
Proteome-Wide Association Study (PWAS) is a protein-centric, gene-based method for conducting genetic association studies. PWAS detects protein-coding genes whose functional variabilities are correlated with given phenotypes across a cohort. It employs a machine-learning model to assess the functional damage caused to each protein within each sample (given the sample's genotype). These assessments are summarized as effect score matrices, where each combination of sample (row) and gene (column) is assigned a number between 0 (complete loss of function) to 1 (no effect). PWAS creates two such matrices, for either dominant or recessive inheritance. Following the creation of those matrices, PWAS can then test various phenotypes, looking for associations between the matrix columns (describing the functional variabilities of specific proteins) to phenotype values. In the case of a binary phenotype, a significant association would mean that the protein coded by the gene appears more damaged in cases than in controls (or vice versa).
For more details on PWAS you can refer to our paper PWAS: proteome-wide association study—linking genes and phenotypes by functional variation in proteins <https://doi.org/10.1186/s13059-020-02089-x>_, published in Genome Biology (2020).
Or, if you are more a video person, you can watch this talk on YouTube (originally given at RECOMB 2020):
.. image:: https://img.youtube.com/vi/TcgE_xb8ecw/0.jpg :target: https://www.youtube.com/watch?v=TcgE_xb8ecw
Installation
Dependencies
PWAS requires Python 3.
Depending on the format of your genetic data, you will have to manually install a relevant Python parser before using PWAS. If you use the PLINK/BED format, you will have to install the pandas-plink <https://pypi.org/project/pandas-plink/>_ Python package. If you use the BGEN format, install bgen_parser <https://github.com/nadavbra/bgen_parser>_.
Part of PWAS's pipeline also requires other tools (see details below). Specifically, step 2.3 requires a variant assessment tool such as FIRM <https://github.com/nadavbra/firm>_.
Upon installation, PWAS will automatically add the following Python packages:
- numpy
- scipy
- pandas
- matplotlib
- biopython
- statsmodels
Install PWAS
Simply run:
.. code-block:: sh
pip install pwas
Important: Make sure that the pip command refers to Python 3. If you are uncertain, consider using pip3 instead.
Alternatively, to install PWAS directly from this GitHub repository, clone the project into a local directory and run from it:
.. code-block:: sh
git submodule update --init python3 setup.py install
Obtaining the reference genome files
If you need to determine the reference allele of each variant (step 2.2 described below), you will need to download the relevant version of the reference genome. Note that if you use FIRM you will have to download these files anyway.
The reference genome sequences of all human chromosomes (chrXXX.fa.gz files) can be downloaded from UCSC's FTP site at:
- ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ (for version hg19)
- ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/ (for version hg38/GRCh38)
The chrXXX.fa.gz files need to be uncompressed to obtain chrXXX.fa files.
IMPORTANT: In version hg19 there's an inconsistency in the reference genome of the M chromosome between UCSC and RegSeq/GENCODE, so the file chrM.fa has to be taken from RefSeq (NC_012920.1) instead of UCSC, from: https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&log$=seqview&db=nuccore&report=fasta&sort=&id=251831106&from=begin&to=end&maxplex=1. In GRCh38 all downloaded files should remain as they are.
Usage
Overview
PWAS requires the following input files:
-
Phenotypes and (optionally) covariates in a CSV file
-
Genotype files [Currently only the PLINK/BED and BGEN formats are supported. An effort to also support VCF files is currently underway, and it should be relatively easy to extend the code to support other formats as well.]
Running PWAS consists of the following steps:
-
Obtain the input genotype & phenotype files
-
Determine per-variant effect scores, which consists of:
2.1. List all the unique variants in the input genotyping files
2.2. (Optional) Determine the reference allele of each variant
2.3. Calculate the effect score of each variant (using the variant assessment tool of your choice)
-
Aggregate the per-variant into per-gene effect scores, which consists of:
3.1. Collect the variant effect scores per gene
3.2. Combine the variant effect scores with per-sample genotypes to obtain gene effect scores across the cohort's samples
-
Find gene-phenotype associations, which consists of:
4.1. Run association tests (between a selected phenotype to the calculated gene effect scores)
4.2. Collect the results and perform multiple-hypothesis testing correction
To ensure maximal flexibility and allow the integration of PWAS with other tools in a modular way, each of these steps consists of a separate command-line with well-defined inputs and outputs. This means that each of these steps can be skipped at your choice, given that you can provide the inputs necessary for the following steps by some alternative way.
Step 1: Obtain the input genotype & phenotype files
As stated, PWAS requires a CSV file with the phenotypic fields of your cohort. This CSV file requires a single column designated for unique sample identifiers (which should correspond to the identifiers in your genotype files). The CSV file should also contain one or more columns for the phenotypes you wish to test, and (preferably) covariates you wish to account for when testing the phenotypes (e.g. sex, age, genetic principal components, genetic batch, etc.). All phenotype and covariate fields must be numeric (i.e. 0s and 1s in the case of binary fields, or any number in the case of continuous fields).
If you work with the UK Biobank <https://www.ukbiobank.ac.uk/>, you can use the ukbb_parser package <https://github.com/nadavbra/ukbb_parser> to easily create a CSV dataset with selected phenotype fields (and automatically extracted covariates for genetic association tests) through its command-line interface <https://github.com/nadavbra/ukbb_parser#command-line-api>_.
For example, the following command will create a suitable dataset with 49 prominent phenotypes (both binary/categorical and continuous) and 173 covariates extracted from the UK Biobank (assuming that you have access to the relevant UKBB fields).
.. code-block:: sh
wget https://raw.githubusercontent.com/nadavbra/ukbb_parser/master/examples/phenotype_specs.py
create_ukbb_phenotype_dataset --phenotype-specs-file=./phenotype_specs.py --output-dataset-file=./ukbb_dataset.csv --output-covariates-columns-file=./ukbb_covariate_columns.json
On top of the CSV of phenotypes, you will also need a CSV file specifying all the relevant genotyping files. This meta file is expected to list all the relevant genotype sources (one per row), having the following headers:
- name: A unique identifier of the genotype source (e.g. the name of the chromosome or genomic segment)
- format: The format of the genotype source (currently supporting only plink and bgen).
Genotype sources of plink format are expected to have three additional columns: bed_file_path, bim_file_path and fam_file_path (for the BED, BIM and FAM files, respectively). Likewise, genotype sources of bgen format are expected to have the following three columns: bgen_file_path, bgi_file_path and sample_file_path (for the .bgen, .bgen.bgi and .sample files, respectively).
Generating the meta CSV file of the genotype sources for the UK Biobank dataset can be easily achieved with the same ukbb_parser package. For example, the following command would generate the file for the imputated genotypes in BGEN format:
.. code-block:: sh
create_ukbb_genotype_spec_file --genotyping-type=imputation --output-file=./ukbb_imputation_genotyping_spec.csv
Very important note: There's actually a good reason to choosing the UK Biobank's imputed genotypes over their raw markers. Unlike vanilla GWAS and other gene-based method (e.g. SKAT), for which it's sufficient to have some sampling of the variants in each Linkage Disequilibrium block, PWAS actually requires full knowledge of all the variants present in each sample. The underlying reason is that PWAS actually tries to figure out what happens to the genes (from functional perspective), and missing variants (with functional relevance) are likely to diminish its statistical power to uncover true associations. For this reason, PWAS is expected to work best with complete, unbiased genotyping (e.g. provided by whole-exome sequencing). If your genetic data was collected by SNP-array genotypes, then you will at least have to try to complete the missing variants through imputation.
Step 2: Determine per-variant effect scores
Step 2.1: List all the unique variants in the input genotyping files
To combine all the variant descriptions across the input genotype sources into a unified list, simply use the list_all_variants command provided by PWAS.
For example, to list all the unique imputed variants in the UK Biobank, run:
.. code-block:: sh
list_all_variants --genotyping-spec-file=./ukbb_imputation_genotyping_spec.csv --output-file=./ukbb_imputed_variants.csv --verbose
Step 2.2 (optional): Determine the reference allele of each variant
In most genetic
Related Skills
node-connect
349.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
