Pwas

Proteome-Wide Association Study

Generate Convert Improve

Install / Use

/learn @nadavbra/Pwas

About this skill

Quality Score

0/100

README

What is PWAS?

Proteome-Wide Association Study (PWAS) is a protein-centric, gene-based method for conducting genetic association studies. PWAS detects protein-coding genes whose functional variabilities are correlated with given phenotypes across a cohort. It employs a machine-learning model to assess the functional damage caused to each protein within each sample (given the sample's genotype). These assessments are summarized as effect score matrices, where each combination of sample (row) and gene (column) is assigned a number between 0 (complete loss of function) to 1 (no effect). PWAS creates two such matrices, for either dominant or recessive inheritance. Following the creation of those matrices, PWAS can then test various phenotypes, looking for associations between the matrix columns (describing the functional variabilities of specific proteins) to phenotype values. In the case of a binary phenotype, a significant association would mean that the protein coded by the gene appears more damaged in cases than in controls (or vice versa).

For more details on PWAS you can refer to our paper PWAS: proteome-wide association study—linking genes and phenotypes by functional variation in proteins <https://doi.org/10.1186/s13059-020-02089-x>_, published in Genome Biology (2020).

Or, if you are more a video person, you can watch this talk on YouTube (originally given at RECOMB 2020):

.. image:: https://img.youtube.com/vi/TcgE_xb8ecw/0.jpg :target: https://www.youtube.com/watch?v=TcgE_xb8ecw

Installation

Dependencies

PWAS requires Python 3.

Depending on the format of your genetic data, you will have to manually install a relevant Python parser before using PWAS. If you use the PLINK/BED format, you will have to install the pandas-plink <https://pypi.org/project/pandas-plink/>_ Python package. If you use the BGEN format, install bgen_parser <https://github.com/nadavbra/bgen_parser>_.

Part of PWAS's pipeline also requires other tools (see details below). Specifically, step 2.3 requires a variant assessment tool such as FIRM <https://github.com/nadavbra/firm>_.

Upon installation, PWAS will automatically add the following Python packages:

numpy
scipy
pandas
matplotlib
biopython
statsmodels

Install PWAS

Simply run:

.. code-block:: sh

pip install pwas

Important: Make sure that the pip command refers to Python 3. If you are uncertain, consider using pip3 instead.

Alternatively, to install PWAS directly from this GitHub repository, clone the project into a local directory and run from it:

.. code-block:: sh

git submodule update --init python3 setup.py install

Obtaining the reference genome files

If you need to determine the reference allele of each variant (step 2.2 described below), you will need to download the relevant version of the reference genome. Note that if you use FIRM you will have to download these files anyway.

The reference genome sequences of all human chromosomes (chrXXX.fa.gz files) can be downloaded from UCSC's FTP site at:

ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ (for version hg19)
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/ (for version hg38/GRCh38)

The chrXXX.fa.gz files need to be uncompressed to obtain chrXXX.fa files.

IMPORTANT: In version hg19 there's an inconsistency in the reference genome of the M chromosome between UCSC and RegSeq/GENCODE, so the file chrM.fa has to be taken from RefSeq (NC_012920.1) instead of UCSC, from: https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&log$=seqview&db=nuccore&report=fasta&sort=&id=251831106&from=begin&to=end&maxplex=1. In GRCh38 all downloaded files should remain as they are.

Usage

Overview

PWAS requires the following input files:

Phenotypes and (optionally) covariates in a CSV file
Genotype files [Currently only the PLINK/BED and BGEN formats are supported. An effort to also support VCF files is currently underway, and it should be relatively easy to extend the code to support other formats as well.]

Running PWAS consists of the following steps:

Obtain the input genotype & phenotype files
Determine per-variant effect scores, which consists of:

2.1. List all the unique variants in the input genotyping files

2.2. (Optional) Determine the reference allele of each variant

2.3. Calculate the effect score of each variant (using the variant assessment tool of your choice)
Aggregate the per-variant into per-gene effect scores, which consists of:

3.1. Collect the variant effect scores per gene

3.2. Combine the variant effect scores with per-sample genotypes to obtain gene effect scores across the cohort's samples
Find gene-phenotype associations, which consists of:

4.1. Run association tests (between a selected phenotype to the calculated gene effect scores)

4.2. Collect the results and perform multiple-hypothesis testing correction

To ensure maximal flexibility and allow the integration of PWAS with other tools in a modular way, each of these steps consists of a separate command-line with well-defined inputs and outputs. This means that each of these steps can be skipped at your choice, given that you can provide the inputs necessary for the following steps by some alternative way.

Step 1: Obtain the input genotype & phenotype files

As stated, PWAS requires a CSV file with the phenotypic fields of your cohort. This CSV file requires a single column designated for unique sample identifiers (which should correspond to the identifiers in your genotype files). The CSV file should also contain one or more columns for the phenotypes you wish to test, and (preferably) covariates you wish to account for when testing the phenotypes (e.g. sex, age, genetic principal components, genetic batch, etc.). All phenotype and covariate fields must be numeric (i.e. 0s and 1s in the case of binary fields, or any number in the case of continuous fields).

If you work with the UK Biobank <https://www.ukbiobank.ac.uk/>, you can use the ukbb_parser package <https://github.com/nadavbra/ukbb_parser> to easily create a CSV dataset with selected phenotype fields (and automatically extracted covariates for genetic association tests) through its command-line interface <https://github.com/nadavbra/ukbb_parser#command-line-api>_.

For example, the following command will create a suitable dataset with 49 prominent phenotypes (both binary/categorical and continuous) and 173 covariates extracted from the UK Biobank (assuming that you have access to the relevant UKBB fields).

.. code-block:: sh

wget https://raw.githubusercontent.com/nadavbra/ukbb_parser/master/examples/phenotype_specs.py
create_ukbb_phenotype_dataset --phenotype-specs-file=./phenotype_specs.py --output-dataset-file=./ukbb_dataset.csv --output-covariates-columns-file=./ukbb_covariate_columns.json

On top of the CSV of phenotypes, you will also need a CSV file specifying all the relevant genotyping files. This meta file is expected to list all the relevant genotype sources (one per row), having the following headers:

name: A unique identifier of the genotype source (e.g. the name of the chromosome or genomic segment)
format: The format of the genotype source (currently supporting only plink and bgen).

Genotype sources of plink format are expected to have three additional columns: bed_file_path, bim_file_path and fam_file_path (for the BED, BIM and FAM files, respectively). Likewise, genotype sources of bgen format are expected to have the following three columns: bgen_file_path, bgi_file_path and sample_file_path (for the .bgen, .bgen.bgi and .sample files, respectively).

Generating the meta CSV file of the genotype sources for the UK Biobank dataset can be easily achieved with the same ukbb_parser package. For example, the following command would generate the file for the imputated genotypes in BGEN format:

.. code-block:: sh

create_ukbb_genotype_spec_file --genotyping-type=imputation --output-file=./ukbb_imputation_genotyping_spec.csv

Very important note: There's actually a good reason to choosing the UK Biobank's imputed genotypes over their raw markers. Unlike vanilla GWAS and other gene-based method (e.g. SKAT), for which it's sufficient to have some sampling of the variants in each Linkage Disequilibrium block, PWAS actually requires full knowledge of all the variants present in each sample. The underlying reason is that PWAS actually tries to figure out what happens to the genes (from functional perspective), and missing variants (with functional relevance) are likely to diminish its statistical power to uncover true associations. For this reason, PWAS is expected to work best with complete, unbiased genotyping (e.g. provided by whole-exome sequencing). If your genetic data was collected by SNP-array genotypes, then you will at least have to try to complete the missing variants through imputation.

Step 2: Determine per-variant effect scores

Step 2.1: List all the unique variants in the input genotyping files

To combine all the variant descriptions across the input genotype sources into a unified list, simply use the list_all_variants command provided by PWAS.

For example, to list all the unique imputed variants in the UK Biobank, run:

.. code-block:: sh

list_all_variants --genotyping-spec-file=./ukbb_imputation_genotyping_spec.csv --output-file=./ukbb_imputed_variants.csv --verbose

Step 2.2 (optional): Determine the reference allele of each variant

In most genetic

Related Skills

node-connect

349.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。