popscle

popscle is a suite of population scale analysis tools for single-cell genomics data. The key software tools in this repository includes demuxlet (version 2) and freemuxlet, a genotyping-free method to deconvolute barcoded cells by their identities while detecting doublets.

Quick Overview

With popscle, we recommend analyzing single cell RNA-seq (and other single cell genomic) dataset in two steps.

Use dsc-pileup to generate pileups around known variants from aligned sequence reads.
Use demuxlet (with genotypes) or freemuxlet (without genotypes) to deconvolute the identities of barcoded cells.

Read the tutorial at https://github.com/statgen/popscle/wiki , if you would like to learn how to run software tools in popscle by example.

Read the documentation below if you want a comprehensive documentation about these tools.

Introduction

Overview

demuxlet and freemuxlet are two software tools to deconvolute sample identity and identify multiplets when multiple samples are pooled by barcoded single cell sequencing. If external genotyping data for each sample is available (e.g. from SNP arrays), demuxlet would be recommended. On the other hand, if external genotyping data is not available, the genotyping-free version demuxlet, freemuxlet, would be recommended. You still need variant site list (in VCF) even if you intend to use freemuxlet in order to generate pileups.

You need to run dsc-pileup before running demuxlet and freemuxlet. dsc-pileup is a software tool to pileup reads and corresponding base quality for each overlapping SNPs and each barcode. By using pileup files, it would allow us to run demuxlet/freemuxlet pretty fast multiple times without going over the BAM file again.

dsc-pileup requires the following input files:

a SAM/BAM/CRAM file produced by the standard 10x sequencing platform, or any other barcoded single cell RNA-seq (with proper --tag-UMI and --tag-group) options.
A VCF/BCF files containing (AC) and (AN) from referenced population (e.g. 1000g).

demuxlet require the following input files:

Pileup files (CEL,VAR and PLP) produced by dsc-pileup.
a VCF/BCF file containing the genotype (GT), posterior probability (GP), or genotype likelihood (GL) to assign each barcode to a specific sample (or a pair of samples) in the VCF file.

Alternatively, demuxlet could also directly take SAM file without running dsc-pileup. In this case, demuxlet would require the following files:

a SAM/BAM/CRAM file produced by the standard 10x sequencing platform, or any other barcoded single cell RNA-seq (with proper --tag-UMI and --tag-group) options.
a VCF/BCF file containing the genotype (GT), posterior probability (GP), or genotype likelihood (GL) to assign each barcode to a specific sample (or a pair of samples) in the VCF file.

freemuxlet require the following input:

Pileup files (CEL, PLP and VAR) from dsc-pileup
Number of samples

Tips for running

If external reference sequence vcf file is available, demuxlet is recommended
Default setting alpha as 0.5, which assumes the expected proportion of 50% genetic mixture from two individuals, to get better estimates of doublets.
Set --group-list to a list of barcodes (i.e. barcodes.tsv from 10X) in dsc-pileup to speed things up and only get demultiplexing for cells called by other methods.
To reproduce the results presented in Figure 2 of the demuxlet paper, please use the original version of demuxlet, with the data downloadable at https://github.com/yelabucsf/demuxlet_paper_code/tree/master/fig2 . If you want to learn how to perform similar analysis with popscle, please go to https://github.com/statgen/popscle/wiki .
Check tutorial README.md for more detailed tutorial with example data
If you start process in docker, use cmdline docker run <imagename> "<popscle-arguments>"(e.g. docker run popscle "freemuxlet") to run docker tasks.

Installing demuxlet/freemuxlet

<pre> $ mkdir build $ cd build $ cmake .. </pre>

In case any required libraries is missing, you may specify customized installing path by replacing "cmake .." with:

<pre> For libhts: - $ cmake -DHTS_INCLUDE_DIRS=/hts_absolute_path/include/ -DHTS_LIBRARIES=/hts_absolute_path/lib/libhts.a .. For bzip2: - $ cmake -DBZIP2_INCLUDE_DIRS=/bzip2_absolute_path/include/ -DBZIP2_LIBRARIES=/bzip2_absolute_path/lib/libbz2.a .. For lzma: - $ cmake -DLZMA_INCLUDE_DIRS=/lzma_absolute_path/include/ -DLZMA_LIBRARIES=/lzma_absolute_path/lib/liblzma.a .. </pre>

Finally, to build the binary, run

Using demuxlet and freemuxlet

All softwares use a self-documentation utility. You can run each utility with -man or -help option to see the command line usages. Also, we offer some general practice with an example in tutorial (data is available here: https://drive.google.com/drive/folders/1wfnn132vMbZhicpWOZVbR_36YpIiojug?usp=sharing).

demuxlet

<pre> $(POPSCLE_HOME)/bin/popscle dsc-pileup --sam /data/$bam --vcf /data/$ref_vcf --out /data/$pileup $(POPSCLE_HOME)/bin/popscle demuxlet --plp /data/$pileup --vcf /data/$external_vcf --field $(GT or GP or PL) --out /data/$filename </pre>

Or, demuxlet could directly take SAM file as input:

<pre> $(POPSCLE_HOME)/bin/popscle demuxlet --sam /data/$sam --vcf /data/$external_vcf --field $(GT or GP or PL) --out /data/$filename </pre>

freemuxlet

<pre> $(POPSCLE_HOME)/bin/popscle dsc-pileup --sam /data/$bam --vcf /data/$ref_vcf --out /data/$pileup $(POPSCLE_HOME)/bin/popscle freemuxlet --plp /data/$pileup --out /data/$filename --nsample $n </pre>

The detailed usage is also pasted below.

dsc-pileup

<pre> Options for input SAM/BAM/CRAM --sam [STR: ] : Input SAM/BAM/CRAM file. Must be sorted by coordinates and indexed --tag-group [STR: CB] : Tag representing readgroup or cell barcodes, in the case to partition the BAM file into multiple groups. For 10x genomics, use CB --tag-UMI [STR: UB] : Tag representing UMIs. For 10x genomiucs, use UB Options for input VCF/BCF --vcf [STR: ] : Input VCF/BCF file, containing the AC and AN field --sm [V_STR: ] : List of sample IDs to compare to (default: use all) --sm-list [STR: ] : File containing the list of sample IDs to compare Output Options --out [STR: ] : Output file prefix --sam-verbose [INT: 1000000] : Verbose message frequency for SAM/BAM/CRAM --vcf-verbose [INT: 10000] : Verbose message frequency for VCF/BCF --skip-umi [FLG: OFF] : Do not generate [prefix].umi.gz file, which stores the regions covered by each barcode/UMI pair SNP-overlapping Read filtering Options --cap-BQ [INT: 40] : Maximum base quality (higher BQ will be capped) --min-BQ [INT: 13] : Minimum base quality to consider (lower BQ will be skipped) --min-MQ [INT: 20] : Minimum mapping quality to consider (lower MQ will be ignored) --min-TD [INT: 0] : Minimum distance to the tail (lower will be ignored) --excl-flag [INT: 3844] : SAM/BAM FLAGs to be excluded Cell/droplet filtering options --group-list [STR: ] : List of tag readgroup/cell barcode to consider in this run. All other barcodes will be ignored. This is useful for parallelized run --min-total [INT: 0] : Minimum number of total reads for a droplet/cell to be considered --min-uniq [INT: 0] : Minimum number of unique reads (determined by UMI/SNP pair) for a droplet/cell to be considered --min-snp [INT: 0] : Minimum number of SNPs with coverage for a droplet/cell to be considered </pre>

demuxlet

<pre> Options for input SAM/BAM/CRAM --sam [STR: ] : Input SAM/BAM/CRAM file. Must be sorted by coordinates and indexed --tag-group [STR: CB] : Tag representing readgroup or cell barcodes, in the case to partition the BAM file into multiple groups. For 10x genomics, use CB --tag-UMI [STR: UB] : Tag representing UMIs. For 10x genomiucs, use UB Options for input Pileup format --plp [STR: ] : Input pileup format Options for input VCF/BCF --vcf [STR: ] : Input VCF/BCF file, containing the individual genotypes (GT), posterior probability (GP), or genotype likelihood (PL) --field [STR: GP] : FORMAT field to extract the genotype, likelihood, or posterior from --geno-error-offset [FLT: 0.10] : Offset of genotype error rate. [error] = [offset] + [1-offset]*[coeff]*[1-r2] --geno-error-coeff [FLT: 0.00] : Slope of genotype error rate. [error] = [offset] + [1-offset]*[coeff]*[1-r2] --r2-info [STR: R2] : INFO field name representing R2 value. Used for representing imputation quality --min-mac [INT: 1] : Minimum minor allele frequency --min-callrate [FLT: 0.50] : Minimum call rate --sm [V_STR: ] : List of sample IDs to compare to (default: use all) --sm-list [STR: ] : File containing the list of sample IDs to compare Output Options --out [STR: ] : Output file prefix --alpha [V_FLT: ] : Grid of alpha to search for (default is 0.1, 0.2, 0.3, 0.4, 0.5) --doublet-prior [FLT: 0.50] : Prior of doublet --sam-verbose [INT: 1000000] : Verbose message frequency for SAM/BAM/CRAM --vcf-verbose [INT: 10000] : Verbose message frequency for VCF/BCF Read filtering Options --cap-BQ [INT: 40] : Maximum base quality (higher BQ will be capped) --min-BQ [INT: 13] : Minimum base quality to consider (lower BQ will be skipped) --min-MQ [INT: 20] : Minimum mapping quality to consider (lower MQ will be ignored) --min-TD [INT: 0] : Minimum distance to the tail (lower will be ignored) --excl-flag [INT: 3844] : SAM/BAM FLAGs to be excluded Cell/droplet filtering options --group-list [STR: ] : List of tag readgroup/cell barcode to consider in this run. All other barcodes will be ignored. This is useful for parallelized run --min-total [INT: 0] : Minimum number of total reads for a droplet/cell to be considered --min-uniq [INT: 0] : Minimum number of unique reads (determined by UMI/SNP pair) for a droplet/cell to be considered --min

Popscle

Install / Use

README

popscle

Quick Overview

Introduction

Overview

Tips for running

Installing demuxlet/freemuxlet

Using demuxlet and freemuxlet

demuxlet

freemuxlet

dsc-pileup

demuxlet