SparsePainter
SparsePainter: fast, accurate and fine-scale chromosome painting software based on PBWT and HashMap
Install / Use
/learn @YaolingYang/SparsePainterREADME
SparsePainter
SparsePainter is an efficient tool for local ancestry inference (LAI) coded in C++. It extends PBWT algorithm to find K longest matches at each position, and uses the Hash Map structure to implement the forward and backward algorithm in the Hidden Markov Model (HMM) leveraging the sparsity of haplotype matches. SparsePainter can infer fine-scale local ancestry (per individual per SNP) and genome-wide total ancestry, it also enables efficiently calculating Linkage Disequilibrium of Ancestry (LDA), LDA score (LDAS) and Ancestry Anomaly Score (AAS) for understanding the population structure, evolution, selection, etc.
SparsePainter also produces output required to run GLOBETROTTER, fastGLOBETROTTER and SOURCEFIND (example codes to run SOURCEFIND based on SparsePainter output are available).
-
Authors:
Yaoling Yang (yaolingyang1998@gmail.com)
Daniel Lawson (dan.lawson@bristol.ac.uk) -
Maintainer:
Yaoling Yang (yaolingyang1998@gmail.com) -
SparsePainter website: https://sparsepainter.github.io/
-
Version: 1.3.2 (Changelog)
-
SparsePainter and PBWTpaint Reference: Yang, Y., Durbin, R., Iversen, A.K.N & Lawson, D.J. Sparse haplotype-based fine-scale local ancestry inference at scale reveals recent selection on immune responses. Nature Communications 16, 2742 (2025).
-
Pipeline for biobank-scale painting and computing haplotype components (HCs) are available.
-
LDA, LDA score and AAS Reference: Barrie, W., Yang, Y., Irving-Pease, E.K. et al. Elevated genetic risk for multiple sclerosis emerged in steppe pastoralist populations. Nature 625, 321–328 (2024)
-
PBWTpaint GitHub Repository: https://github.com/richarddurbin/pbwt
-
A use case of HCs: Yang, Y., Lawson, D.J. From individuals to ancestries: towards attributing trait variation to haplotypes. medRxiv (2025), doi: https://doi.org/10.1101/2025.03.13.25323895.
-
Overview of SparsePainter and PBWTpaint

Installation
To install SparsePainter, please follow the below steps.
git clone git@github.com:YaolingYang/SparsePainter.git
cd SparsePainter
make
To update the newer version of SparsePainter, you can remove lines 10-12 of Makefile, since armadillo has already been installed during your initial installation.
Dependencies
SparsePainter requires g++ >=6 and depends on
Armadillo-v12.6.5 to compute AAS;
gzstream-v1.5 to read and write gzipped files.
Usage
Either variant call format (VCF) or phase format is supported by SparsePainter. Both files should be phased and without missing data. Inputting phase format is slightly faster than inputting the VCF format. To prepare the phase format for SparsePainter, you should get PBWT installed, which converts Variant Call Format (VCF) to phase format by the following command:
pbwt -readVcfGT XXX.vcf -writePhase XXX.phase
To run SparsePainter, enter the following command:
./SparsePainter [-command1 -command2 ...... -command3 parameter3 -command4 parameter4 ......]
Commands
Required Commands
SparsePainter has below 6 required commands together with additional commands that specify the desired output.
-
-reffile [file] Reference vcf (including gzipped vcf), or phase (including gzipped phase) file that contains the (phased non-missing) genotype data for the reference samples.
-
-targetfile [file] Reference vcf (including gzipped vcf), or phase (including gzipped phase) file that contains the (phased non-missing) genotype data for the target samples. To paint reference samples against themselves, please set
targetfileto be the same asreffile. The file type oftargetfileandreffileshould be the same. -
-mapfile [file] Genetic map file that contains two columns with headers. The first column is the SNP position (in base) and the second column is the genetic distance of each SNP (in centiMorgan). The SNPs must be the same and of the same order as those in
reffileandtargetfile. -
-popfile [file] Population file of reference individuals that contains two columns without headers. The first column is the names of all the reference samples (must be in the same order as
reffile). The second column is the population labels of the reference samples, which can be either strings or numbers. -
-namefile [file] Name file that contains the names of samples to be painted, following the same order as they appear in
targetfile. -
-out [string] Prefix of the output file names (default=SparsePainter).
At least one of the below commands should also be given in order to run SparsePainter
-
-prob Output the local ancestry probabilities for each target sample at each SNP. The output is a gzipped text file (.txt.gz) with format specified in
-probstore. -
-chunklength Output the expected length (in centiMorgan) of copied chunks of each local ancestry for each target sample. The output is a gzipped text file (.txt.gz).
-
-chunkcount Output the expected number of copied chunks of each local ancestry for each target sample. The output is a gzipped text file (.txt.gz).
-
-sample Output the sampled reference haplotypes' indices for each target sample at each SNP. The output is a gzipped text file (.txt.gz), which is the same format as the
.samples.outfile of ChromoPainter, and is the required input file to run GLOBETROTTER and fastGLOBETROTTER. -
-aveSNP Output the average local ancestry probabilities for each SNP. The output is a text file (.txt).
-
-aveind Output the average local ancestry probabilities for each target individual. The output is a text file (.txt).
-
-LDA Output the Linakage Disequilibrium of Ancestry (LDA) of each pair of SNPs. The output is a gzipped text file (.txt.gz). It might be slow: the computational time is proportional to the number of local ancestries and the density of SNPs in the chromosome.
-
-LDAS Output the Linakage Disequilibrium of Ancestry Score (LDAS) of each SNP. The output is a text file (.txt), including the LDAS and its lower and upper bound, which can be used for quality control. It might be slow: the computational time is proportional to the number of local ancestries and the density of SNPs in the genome.
-
-AAS Output the test statistic of Ancestry Anomaly Score (AAS) of each SNP. The output is a text file (.txt). The AAS test statistic follows chi-squared distribution with K degrees of freedom under the null, where K is the number of reference populations.
Optional Commands
Commands without values
-
-haploid The individuals are haploid.
-
-diff_lambda Use different recombination scaling constants for each target sample. If this command is not given, the fixed lambda will be output in a text file (.txt) for future reference.
-
-loo Paint with leave-one-out strategy: one individual is left out of each population (self from own population). If
-loois not specified under reference-vs-reference painting (reffile=targetfile), each individual will be automatically left out of painting. For accuracy, please do not use this command if any of the reference populations has very few (e.g. <=5) samples. -
-rmrelative Leave out the reference sample that is the most related to the target sample under leave-one-out mode (
-loo), if they share at leastrelafracproportion of SNPs of a continuous segment. Please do not use this command for reference-vs-reference painting. -
-outmatch Output the number of matches at each SNP for each target haplotype. The output file format is a gzipped text file (.txt.gz).
Commands with values
-
-ncores [integer≥0] The number of CPU cores used for the analysis (default=0). The default
ncoresuses all the available CPU cores of your device. -
-fixlambda [number≥0] The value of the fixed recombination scaling constant (default=0). SparsePainter will estimate lambda as the average recombination scaling constant of
indfractarget samples under the defaultfixlambdaanddiff_lambda. -
-nmatch [integer>=1] The number of haplotype matches of at least
LminSNPs that SparsePainter searches for (default=10). Positions with more thannmatchmatches of at leastLminSNPs will retain at least the longestnmatchmatches. A largernmatchslightly improves accuracy but significantly increases the computational time. -
-L0 [integer>0] The initial length of matches (the number of SNPs) that SparsePainter searches for (default=320).
L0must be bigger thanLminand
Related Skills
node-connect
347.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
