Cleansumstats

Convert GWAS sumstat files into a common format with a common reference for positions, rsids and effect alleles.

Generate Convert Improve

Install / Use

/learn @BioPsyk/Cleansumstats

About this skill

Quality Score

0/100

README

cleansumstats

Convert GWAS sumstat files into a common format with a common reference for positions, rsids and effect alleles

Introduction

The cleansumstats pipeline takes a typical genomic sumstat file as input(normally the output from a GWAS), together with specifiers for chr, pos and available stats.

Quick Start

To run a quick test using provided example and test data. Use either singularity or docker depending on what is available on your system. Note that Singularity has been renamed Apptainer.

# Make sure git and either singularity or docker are installed
git --version
singularity --version
docker --version

# clone and enter the cleansumstats github project
git clone https://github.com/BioPsyk/cleansumstats.git
cd cleansumstats

Singularity

using singularity (use path to image)

## pull singularity image for AMD64/x86_64 systems (most common)
mkdir -p sif
singularity pull sif/ibp-cleansumstats-base_version-1.3.1.sif docker://biopsyk/ibp-cleansumstats:1.3.1-amd64

# clean a sumstat using shrinked example data for dbsnp and 1kgp (-e flag)
./cleansumstats.sh \
  -j sif/ibp-cleansumstats-base_version-1.3.1.sif \
  -i tests/example_data/sumstat_1/sumstat_1_raw_meta.txt \
  -o out_example \
  -e 1

Docker

using docker image (use the tag: dockerhub_biopsyk)

## pull docker image for AMD64/x86_64 systems (most common)
docker pull biopsyk/ibp-cleansumstats:1.3.1-amd64

## using docker (using flag -j)
./cleansumstats.sh \
  -j dockerhub_biopsyk \
  -i tests/example_data/sumstat_1/sumstat_1_raw_meta.txt \
  -o out_example \
  -e 1

Note: For ARM64 systems (e.g., Apple Silicon Macs), append -arm64 to the version tag instead of -amd64. For example: 1.3.0-arm64.

Add full size reference data

In the cleaning all positions are compared to a reference to confirm or add missing annotation.

dbsnp reference

The preparation of the dbsnp reference only has to be done once, and can be reused for all sumstats that needs cleaning.

# i. Download the dbsnp reference and supplemental files: size 25GB 
mkdir -p dbsnp
wget -P dbsnp https://ftp.ncbi.nlm.nih.gov/snp/archive/b156/VCF/CHECKSUMS
wget -P dbsnp https://ftp.ncbi.nlm.nih.gov/snp/archive/b156/VCF/GCF_000001405.40.gz.md5
wget -P dbsnp https://ftp.ncbi.nlm.nih.gov/snp/archive/b156/VCF/GCF_000001405.40.gz.tbi
wget -P dbsnp https://ftp.ncbi.nlm.nih.gov/snp/archive/b156/VCF/GCF_000001405.40.gz

# ii. If you are on a HPC Start your interactive session (below SLURM settings took about 5h to run)
srun --mem=400g --ntasks 1 --cpus-per-task 60 --time=10:00:00 --account ibp_pipeline_cleansumstats --pty /bin/bash
./cleansumstats.sh \
  prepare-dbsnp \
  -i dbsnp/GCF_000001405.40.gz \
  -o out_dbsnp

1000 genomes project reference

# i. Download
mkdir -p 1kgp
wget -P 1kgp https://ftp.ensembl.org/pub/release-112/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz
wget -P 1kgp https://ftp.ensembl.org/pub/release-112/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz.csi

# ii. If you are on a HPC Start your interactive session (below SLURM settings took about 5min to run)
srun --mem=80g --ntasks 1 --cpus-per-task 5 --time=1:00:00 --account ibp_pipeline_cleansumstats --pty /bin/bash
./cleansumstats.sh \
  prepare-1kgp \
  -i 1kgp/1000GENOMES-phase_3.vcf.gz \
  -d out_dbsnp \
  -o out_1kgp_test

Prepare meta data files

After the reference data (dbsnp and 1000 genomes) has been created it is time to prepare the input for the actual cleaning. This file is called the meta file, and contains paths to other important files, such as the actual sumstats, README, article pdf, etc,. for which all need to be in the same folder as their corresponding metafile. This file has to be filled in manually, see tests/example_data/sumstat_1/sumstat_1_raw_meta.txt for an example of how it looks like.

You can also use this webinterface to generate a metadatafile. Again, remember that all files referred to by the metadatafile have to be in the same directory as the metafile when you run cleansumstats. Check tests/example_data and sumstats 1-5 for an example of how you can structure your input folders.

There is no support for relative links in the metadata file, which means all files have to be in the same folder. However, you can provide paths to associated files -p path/to/folder1,path/to/folder2

Run a fully operational cleaning pipeline

This will take longer time compared to the quick-start run as we now use the full >600 million rows dbsnp reference to map our variants to.

When you have prepared your meta data files, then replace -i example data with your own data.

# i. If you are on a HPC Start your interactive session (below SLURM settings took about 10min to run)
srun --mem=40g --ntasks 1 --cpus-per-task 6 --time=1:00:00 --account ibp_pipeline_cleansumstats --pty /bin/bash
./cleansumstats.sh \
  -i tests/example_data/sumstat_1/sumstat_1_raw_meta.txt \
  -d out_dbsnp \
  -k out_1kgp \
  -o out_clean

# For additional flags, see:
./cleansumstats.sh -h

Credits

cleansumstats was originally written by Jesper R. Gådin

Related Skills

node-connect

343.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

90.0k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

343.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

343.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

BioPsyk

View profile

View on GitHub

GitHub Stars24

CategoryDevelopment

Updated2mo ago

Forks3

BioPsyk/cleansumstats

Languages

Shell

Security Score

80/100

Audited on Jan 29, 2026

No findings

Cleansumstats

Install / Use

README

cleansumstats

Introduction

Quick Start

Singularity

Docker

Add full size reference data

dbsnp reference

1000 genomes project reference

Prepare meta data files

Run a fully operational cleaning pipeline

More documentation

Credits

Related Skills