SumStatsRehab
GWAS summary statistics files QC tool
Install / Use
/learn @Kukuster/SumStatsRehabREADME
SumStatsRehab
SumStatsRehab is a universal GWAS SumStats pre-processing tool. SumStatsRehab takes care of each of the original data points to maximize statistical power of downstream calculations. Currently, the only supported processing which may result in a loss of the original data is liftover, which is a common task, and is optional to the user.
Examples of what the tool does:
- data validation (e.g.
diagnosecommand), - data enrichment (e.g. restoration of a missing Chr and BP fields from rsID in the input GWAS SumStats file by rsID lookup in the input dbSNP dataset),
- data correction (restoration of invariably erroneous values),
- data formatting (e.g. sorting),
- data restoration (e.g. calculation of a StdErr field from present p-value and beta fields).
Example of what the tool does not:
- data cleaning/reduction
SumStatsRehab aims to be a production-grade software, but you may find it to be not a complete data preparation solution for sumstats-ingesting pipelines. Yet, the tool streamlines the development of the data preparation part. This comes out from focusing on solving more complex problems that span many different use-cases, instead of covering only specific use-cases.
dependencies:
- python 3.8+
- a GNU/Linux with bash v4 or 5.
- several python packages in
requirements.txt - bcftools (only for the
prepare_dbSNPscommand) - gz-sort (only for the
prepare_dbSNPscommand)
Installation and basics
- clone this repo
git clone https://github.com/Kukuster/SumStatsRehab.git && cd SumStatsRehab
- install requirements
pip install -r requirements.txt
- install SumStatsRehab as a package
python3 setup.py build
python3 setup.py install
- run the command using the following syntax:
SumStatsRehab <command> [keys]
Use diagnose to check the validity of entries in the GWAS SS file.
Use fix to restore missing/invalid data in the GWAS SS file.
Use prepare_dbSNPs to preprocess a given dbSNP dataset into 2 datasets, which are used in the fix command.
Use sort to format the input GWAS SS file and sort either by Chr and BP or by rsID.
To use the fix command to its fullest, a user needs:
- SNPs datasets in the target build, preprocessed with the
prepare_dbSNPscommand. - chain file, if the GWAS SS file is provided in build different from the target build
Deprecated installation method (since March 15, 2022):
pip install git+https://github.com/Kukuster/SumStatsRehab.git
or, for specific version:
pip install git+https://github.com/Kukuster/SumStatsRehab.git@v1.2.0 --upgrade
This installation method doesn't work with the currently upcoming git protocol security update on github:
- https://github.blog/2021-09-01-improving-git-protocol-security-github/
Tutorial
1. Download dbSNP dataset
Download dbSNP datasets from NCBI, in the target build, in vcf, vcf.gz, bcf, or bcf.gz format. The latest versions are recommended. dbSNP datasets are used to restore the following data: Chr, BP, rsID, OA, EA, EAF. Although only builds 37 and 38 are explicitly supported, build 36 may work as well.
For example, curently latest datasets for build 38 and build 37 can be downloaded here:
https://ftp.ncbi.nih.gov/snp/latest_release/VCF/
2. Download the chain file
A chain file is necessary to perform liftover. If a GWAS SS file is provided in the target build, then a chain file is not used.
3. Preprocess dbSNPs datasets
3.1 Download and install bcftools and gz-sort
see instructions on their websites and/or githubs
recommended bcftools version: 1.11
NOTE: after preprocessing of the necessary dbSNPs is finished, these tools are no longer needed
3.2 Run preprocessing
Run prepare_dbSNPs using the following syntax:
SumStatsRehab prepare_dbSNPs --dbsnp DBSNP --OUTPUT OUTPUT --gz-sort GZ_SORT --bcftools BCFTOOLS
[--buffer BUFFER]
where:
DBSNPis the dbSNP dataset in vcf, vcf.gz, bcf, or bcf.gz format referencing build 38 or 37OUTPUTis the base name for the two output dbSNPs datasetsGZ_SORTis a path to the gz-sort executableBCFTOOLSis a path to the bcftools executableBUFFERis buffer size for sorting (size of presort), supports k/M/G suffix. Defaults to 1G. Recommended: at least 200M, ideally: 4G or more
Depending on the size of the dataset, specified buffer size, and specs of the machine, preprocessing may take somewhere from 30 minutes to 6 hours.
After preprocessing, steps 4 and 5 may be repeated ad-lib.
4. Create a config file for your GWAS SS file
Config file is used as meta data for GWAS SS file, and contains:
- columns' indices (indices start from 0)
- input build slug (such as "GRCh38", "GRCh37", "hg18", "hg19")
This config file has to have the same file name as the GWAS SS file but with an additional .json extension.
For example, if your GWAS SS file is named WojcikG_PMID_htn.gz, and the first 5 lines in the unpacked file are:
Chr Position_hg19 SNP Other-allele Effect-allele Effect-allele-frequency Sample-size Effect-allele-frequency-cases Sample-size-cases Beta SE P-val INFO-score rsid
1 10539 1:10539:C:A C A 0.004378065 49141 0.003603676 27123 -0.1041663 0.1686092 0.5367087 0.46 rs537182016
1 10616 rs376342519:10616:CCGCCGTTGCAAAGGCGCGCCG:C CCGCCGTTGCAAAGGCGCGCCG C 0.9916342 49141 0.9901789 27123 -0.1738814 0.109543 0.1124369 0.604 rs376342519
1 10642 1:10642:G:A G A 0.006042409 49141 0.007277901 27123 0.1794246 0.1482529 0.226179 0.441 rs558604819
1 11008 1:11008:C:G C G 0.1054568 49141 0.1042446 27123 -0.007140072 0.03613677 0.84337 0.5 rs575272151
your config file should have the name WojcikG_PMID_htn.gz.json and the following contents:
{
"Chr": 0,
"BP": 1,
"rsID": 13,
"OA": 3,
"EA": 4,
"EAF": 5,
"beta": 9,
"SE": 10,
"pval": 11,
"INFO": 12,
"N": 6,
"build": "grch37"
}
Notes:
- SumStatsRehab will only consider data from the columns which indices are specified in the config file. If one of the above columns is present in the SS file but wasn't specified in the config file, then SumStatsRehab treats the column as missing.
- In this example, all the 11 columns from the list of supported columns are present. Yet, none of the columns above are mandatory. If certain columns are missing, the
fixcommand will attempt to restore them.
5. Run the fix command
When the config file is created, and dbSNP datasets are preprocessed, the chain file is downloaded if necessary, then the fix command can use all its features.
Although it is normally a part of the execution of the fix command, a user may choose to manually run diagnose beforehand.
If diagnose is ran without additional arguments, it is "read-only", i.e. doesn't write into the file system.
Run diagnose as follows:
SumStatsRehab diagnose --INPUT INPUT_GWAS_FILE
where INPUT_GWAS_FILE is the path to the GWAS SS file with the corresponding config file at *.json
as a result, it will generate the main plot: stacked histogram plot, and an additional bar chart plot for each of the bins in the stacked histogram plot.
These plots will pop up in a new matplotlib window.
The stacked histogram maps the number of invalid SNPs against p-value, allowing assessment of the distribution of invalid SNPs by significance. On the histogram, valid SNPs are shown as blue, and SNPs that have issues are shown as red. The height of the red plot over each bin with the red caption represents the proportion of invalid SNPs in the corresponding bin.

A bar chart is generated for each bin of the stacked histogram plot and reports the number of issues that invalidate the SNP entries in a particular bin.

If a Linux system runs without GUI, the report should be saved on the file system. For this, run the command as follows:
SumStatsRehab diagnose --INPUT INPUT_GWAS_FILE --REPORT-DIR REPORT_DIR
where REPORT_DIR is an existing or not existing directory under which the generated report will be contained. When saved onto a disk, the report also includes a small table with exact numbers of invalid fields and other issues in the GWAS SS file.
Finally, a user may want to decide to run the fix command.
A user should run the fix command as follows:
SumStatsRehab fix --INPUT INPUT_GWAS_FILE --OUTPUT OUTPUT_FILE
[--dbsnp-1 DBSNP1_FILE] [--dbsnp-2 DBSNP2_FILE]
[--chain-file CHAIN_FILE]
[--freq-db FREQ_DATABASE_SLUG]
[{--restore,--do-not-restore} {ChrBP,rsID,OA,EA,EAF,beta,SE,pval}+]
where:
INPUT_GWAS_FILEis the input GWAS SS file with the corresponding.jsonconfig file create at step 4OUTPUT_FILEis the base name for the fixed file(s)DBSNP1_FILEis a path to the preprocessed dbSNP #1DBSNP2_FILEis a path to the preprocessed dbSNP #2CHAIN_FILEis a path to the chain fileFREQ_DATABASE_SLUGis a slug of a frequency database contained in the dbSNP
example:
SumStatsRehab fix --INPUT "2955969
Related Skills
node-connect
342.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.7kCommit, push, and open a PR
