Tools for checking cell identities and keeping the riffraff out of pooled single cell sequencing data sets.

| |I want to...|I have...|Tool to use| |-|------------|---------|-----------| | <img src="img/demux_species_mini.png" alt="demux_species" /> | Demultiplex cells by species |Raw reads, plus a transcriptome (FASTA) or annotation (GTF) and genome (FASTA) per species OR A BAM file of reads mapped to a composite reference genome|demux_species| | <img src="img/demux_mt_mini.png" alt="demux_mt" /> | Demultiplex cells by individual of origin, and I hope individuals are unrelated enough to have different mitochondrial haplotypes|A BAM file of aligned scATAC-seq or whole cell scRNA-seq data|demux_mt| | <img src="img/demux_vcf_mini.png" alt="demux_vcf" /> | Demultiplex cells by individual of origin|VCF of known variants, plus a BAM file of aligned single cell sequencing data|demux_vcf| | <img src="img/demux_tags_mini.png" alt="demux_tags" /> | Demultiplex individuals by custom label or treatment|FASTQs containing MULTIseq/HTO/CITE-seq data, or a table of pre-computed counts, optionally in MEX format|demux_tags| | <img src="img/demux_tags_mini.png" alt="demux_tags" /> | Assign sgRNAs to cells|FASTQs containing sgRNA capture data, or a table of pre-computed counts, optionally in MEX format|demux_tags| | <img src="img/quant_contam_mini.png" alt="quant_contam" /> | Quantify ambient RNA per cell, infer its origins, and optionally adjust gene counts|Output from demux_vcf (plus optional single-cell expression data to adjust, in MEX format)|quant_contam| | <img src="img/doublet_dragon_mini.png" alt="doublet_dragon" /> | Infer global doublet rate and proportions of individuals|Output from one or more CellBouncer programs run on the same cells|doublet_dragon| | <img src="img/bulkprops_mini.png" alt="bulkprops" /> | Determine proportion of individuals in a pool|A VCF of known variants, plus a BAM of aligned sequence data (can be bulk)|bulkprops|

Visualizing and comparing results

|I want to...|I have...|Tool to use| |------------|---------|-----------| | Visualize a set of labels and the pool compositions they produce at different confidence cutoffs | An .assignments file from a CellBouncer program |plot/assignment_llr.R| | Compare two sets of labels on the same cells | Two .assignments files from CellBouncer programs run on the same data | plot/compare_assignments.R| | Merge two sets of labels on the same cells into one set of labels | Two .assignments files from CollBouncer programs run on the same data | utils/merge_assignments.R | Compare two sets of pool proportions and assess significance if possible | Two files describing pool composition (i.e. from bulkprops or contamination profile from quant_contam), or one file describing pool composition and an .assignments file describing cell labels | utils/compare_props.R | | Refine genotype calls to better match cell-individual labels | A preexisting set of genotypes in VCF format, a BAM file of aligned single-cell data, and an .assignments file mapping cells to individuals of origin | utils/refine_vcf | | Plot species proportions | Output from demux_species | plot/species.R | | Plot mitochondrial haplotypes | Output from demux_mt | plot/demux_mt_clust.R plot/demux_mt_unclust.R | | Plot ambient RNA profile | Output from quant_contam | plot/contam.R | | Plot counts of cell identifications, according to different data types and to the consensus among them | Output from doublet_dragon | plot/doublet_dragon.R |

Manipulating input files

|I want to...|I have...|Tool to use| |------------|---------|-----------| |Split a BAM file into one file per cell identity | A BAM file of aligned single-cell sequencing data and a CellBouncer-format .assignments file | utils/bam_split_bcs | |Tag reads in a BAM file to mark individual of origin | A BAM file of aligned single-cell sequencing data and a CellBouncer-format .assignments file | utils/bam_indiv_rg | |Convert 10X or Scanpy (AnnData) data from .h5 to MEX format | A CellRanger-format .h5 or Scanpy-format .h5ad file | utils/h5tomex.py | |Split MEX-format data into one data set per library/run | Single-cell expression data in MEX format | utils/split_mex_libs.py | |Subset MEX-format data to specific cell barcodes | Single-cell expression data in MEX format | utils/subs_mex_bc.py | |Subset MEX-format data to a specific feature type | Single-cell expression data in MEX format | utils/subs_mex_featuretype.py |

Installation

Using Docker

The included Dockerfile can be used to set up and compile everything required by CellBouncer.

Not using Docker

To install (see below):

Clone the repository (and its submodules)
Choose a conda environment file to install
- All necessary dependencies: cellbouncer_minimum.yml
- All necessary dependencies plus extra helper programs mentioned in documentation:
  - Mac OS X: cellbouncer_extra_osx.yml
  - Linux: cellbouncer_extra.yml
Run make

For more information about installing or updating CellBouncer, see here.

Get the repository

git clone --recurse-submodules git@github.com:nkschaefer/cellbouncer.git
cd cellbouncer

Create conda environment

Linux

conda env create --file=cellbouncer_minimum.yml
conda activate cellbouncer
conda env config vars set LD_LIBRARY_PATH="${CONDA_PREFIX}/lib:${LD_LIBRARY_PATH}" -n cellbouncer

Mac OS X (M1)

CONDA_SUBDIR=osx-arm64 conda env create --file=cellbouncer_minimum.yml
conda activate cellbouncer
conda env config vars set DYLD_LIBRARY_PATH="${CONDA_PREFIX}/lib:${DYLD_LIBRARY_PATH}" -n cellbouncer

Mac OS X (Intel)

conda env create --file=cellbouncer_minimum.yml
conda activate cellbouncer
conda env config vars set DYLD_LIBRARY_PATH="${CONDA_PREFIX}/lib:${DYLD_LIBRARY_PATH}" -n cellbouncer

Compile

make

You've now got all the programs compiled, and you can run them as long as you remember to conda activate cellbouncer first.

Test data set

You can get a test data set from this link. It contains an example .bam file, vcf file, and cell hashing .counts file. The README in the linked directory will explain everything, but these give you the opportunity to test-run demux_species, demux_mt, demux_tags, demux_vcf, quant_contam, and doublet_dragon.

Overview

The programs in cellbouncer are standalone command line tools. If you run one of them with no arguments or with -h, it will give you detailed information about how to run it. Each program uses the concept of an --output_prefix/-o, which is a base name that will be used for all output files.

Output files

Demultiplexing tools all write a file called [output_prefix].assignments, which tells you information about each cell's identity. These files are 4 columns, tab separated:

cell barcode (optionally with unique ID appended; see below)
most likely identity (doublets are two names in alphabetical order separated by +)
droplet type: S (for singlet), D (for doublet), or in some cases M (for multiplet, 3+ individuals, so far only considered by demux_tags)
ratio of the log likelihood of the best to the second best assignment (a measure of confidence in the assignment)

Cell barcode format and merging with other data

To load data from CellBouncer into a single cell analysis tool like Seurat or [scanpy](https://scanpy.readth

Cellbouncer

Install / Use

README