SkillAgentSearch skills...

SparsePainter

SparsePainter: fast, accurate and fine-scale chromosome painting software based on PBWT and HashMap

Install / Use

/learn @YaolingYang/SparsePainter
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

SparsePainter

SparsePainter is an efficient tool for local ancestry inference (LAI) coded in C++. It extends PBWT algorithm to find K longest matches at each position, and uses the Hash Map structure to implement the forward and backward algorithm in the Hidden Markov Model (HMM) leveraging the sparsity of haplotype matches. SparsePainter can infer fine-scale local ancestry (per individual per SNP) and genome-wide total ancestry, it also enables efficiently calculating Linkage Disequilibrium of Ancestry (LDA), LDA score (LDAS) and Ancestry Anomaly Score (AAS) for understanding the population structure, evolution, selection, etc.

SparsePainter also produces output required to run GLOBETROTTER, fastGLOBETROTTER and SOURCEFIND (example codes to run SOURCEFIND based on SparsePainter output are available).

Installation

To install SparsePainter, please follow the below steps.
git clone git@github.com:YaolingYang/SparsePainter.git
cd SparsePainter
make

To update the newer version of SparsePainter, you can remove lines 10-12 of Makefile, since armadillo has already been installed during your initial installation.

Dependencies

SparsePainter requires g++ >=6 and depends on
Armadillo-v12.6.5 to compute AAS;
gzstream-v1.5 to read and write gzipped files.

Usage

Either variant call format (VCF) or phase format is supported by SparsePainter. Both files should be phased and without missing data. Inputting phase format is slightly faster than inputting the VCF format. To prepare the phase format for SparsePainter, you should get PBWT installed, which converts Variant Call Format (VCF) to phase format by the following command:

pbwt -readVcfGT XXX.vcf -writePhase XXX.phase

To run SparsePainter, enter the following command:

./SparsePainter [-command1 -command2 ...... -command3 parameter3 -command4 parameter4 ......]

Commands

Required Commands

SparsePainter has below 6 required commands together with additional commands that specify the desired output.

  • -reffile [file] Reference vcf (including gzipped vcf), or phase (including gzipped phase) file that contains the (phased non-missing) genotype data for the reference samples.

  • -targetfile [file] Reference vcf (including gzipped vcf), or phase (including gzipped phase) file that contains the (phased non-missing) genotype data for the target samples. To paint reference samples against themselves, please set targetfile to be the same as reffile. The file type of targetfile and reffile should be the same.

  • -mapfile [file] Genetic map file that contains two columns with headers. The first column is the SNP position (in base) and the second column is the genetic distance of each SNP (in centiMorgan). The SNPs must be the same and of the same order as those in reffile and targetfile.

  • -popfile [file] Population file of reference individuals that contains two columns without headers. The first column is the names of all the reference samples (must be in the same order as reffile). The second column is the population labels of the reference samples, which can be either strings or numbers.

  • -namefile [file] Name file that contains the names of samples to be painted, following the same order as they appear in targetfile.

  • -out [string] Prefix of the output file names (default=SparsePainter).

At least one of the below commands should also be given in order to run SparsePainter

  • -prob Output the local ancestry probabilities for each target sample at each SNP. The output is a gzipped text file (.txt.gz) with format specified in -probstore.

  • -chunklength Output the expected length (in centiMorgan) of copied chunks of each local ancestry for each target sample. The output is a gzipped text file (.txt.gz).

  • -chunkcount Output the expected number of copied chunks of each local ancestry for each target sample. The output is a gzipped text file (.txt.gz).

  • -sample Output the sampled reference haplotypes' indices for each target sample at each SNP. The output is a gzipped text file (.txt.gz), which is the same format as the .samples.out file of ChromoPainter, and is the required input file to run GLOBETROTTER and fastGLOBETROTTER.

  • -aveSNP Output the average local ancestry probabilities for each SNP. The output is a text file (.txt).

  • -aveind Output the average local ancestry probabilities for each target individual. The output is a text file (.txt).

  • -LDA Output the Linakage Disequilibrium of Ancestry (LDA) of each pair of SNPs. The output is a gzipped text file (.txt.gz). It might be slow: the computational time is proportional to the number of local ancestries and the density of SNPs in the chromosome.

  • -LDAS Output the Linakage Disequilibrium of Ancestry Score (LDAS) of each SNP. The output is a text file (.txt), including the LDAS and its lower and upper bound, which can be used for quality control. It might be slow: the computational time is proportional to the number of local ancestries and the density of SNPs in the genome.

  • -AAS Output the test statistic of Ancestry Anomaly Score (AAS) of each SNP. The output is a text file (.txt). The AAS test statistic follows chi-squared distribution with K degrees of freedom under the null, where K is the number of reference populations.

Optional Commands

Commands without values

  • -haploid The individuals are haploid.

  • -diff_lambda Use different recombination scaling constants for each target sample. If this command is not given, the fixed lambda will be output in a text file (.txt) for future reference.

  • -loo Paint with leave-one-out strategy: one individual is left out of each population (self from own population). If -loo is not specified under reference-vs-reference painting (reffile=targetfile), each individual will be automatically left out of painting. For accuracy, please do not use this command if any of the reference populations has very few (e.g. <=5) samples.

  • -rmrelative Leave out the reference sample that is the most related to the target sample under leave-one-out mode (-loo), if they share at least relafrac proportion of SNPs of a continuous segment. Please do not use this command for reference-vs-reference painting.

  • -outmatch Output the number of matches at each SNP for each target haplotype. The output file format is a gzipped text file (.txt.gz).

Commands with values

  • -ncores [integer≥0] The number of CPU cores used for the analysis (default=0). The default ncores uses all the available CPU cores of your device.

  • -fixlambda [number≥0] The value of the fixed recombination scaling constant (default=0). SparsePainter will estimate lambda as the average recombination scaling constant of indfrac target samples under the default fixlambda and diff_lambda.

  • -nmatch [integer>=1] The number of haplotype matches of at least Lmin SNPs that SparsePainter searches for (default=10). Positions with more than nmatch matches of at least Lmin SNPs will retain at least the longest nmatch matches. A larger nmatch slightly improves accuracy but significantly increases the computational time.

  • -L0 [integer>0] The initial length of matches (the number of SNPs) that SparsePainter searches for (default=320). L0 must be bigger than Lmin and

Related Skills

View on GitHub
GitHub Stars18
CategoryDevelopment
Updated1mo ago
Forks3

Languages

C++

Security Score

90/100

Audited on Feb 11, 2026

No findings