Cellbouncer
Tools for pooled single cell sequencing experiments: demultiplex cells, infer doublet rate, assign treatments/sgRNAs, infer ambient RNA from allele matching
Install / Use
/learn @nkschaefer/CellbouncerREADME
Tools for checking cell identities and keeping the riffraff out of pooled single cell sequencing data sets.
| |I want to...|I have...|Tool to use|
|-|------------|---------|-----------|
| <img src="img/demux_species_mini.png" alt="demux_species" /> | Demultiplex cells by species |Raw reads, plus a transcriptome (FASTA) or annotation (GTF) and genome (FASTA) per species <p align="center">OR</p> A BAM file of reads mapped to a composite reference genome|demux_species|
| <img src="img/demux_mt_mini.png" alt="demux_mt" /> | Demultiplex cells by individual of origin, and I hope individuals are unrelated enough to have different mitochondrial haplotypes|A BAM file of aligned scATAC-seq or whole cell scRNA-seq data|demux_mt|
| <img src="img/demux_vcf_mini.png" alt="demux_vcf" /> | Demultiplex cells by individual of origin|VCF of known variants, plus a BAM file of aligned single cell sequencing data|demux_vcf|
| <img src="img/demux_tags_mini.png" alt="demux_tags" /> | Demultiplex individuals by custom label or treatment|FASTQs containing MULTIseq/HTO/CITE-seq data, or a table of pre-computed counts, optionally in MEX format|demux_tags|
| <img src="img/demux_tags_mini.png" alt="demux_tags" /> | Assign sgRNAs to cells|FASTQs containing sgRNA capture data, or a table of pre-computed counts, optionally in MEX format|demux_tags|
| <img src="img/quant_contam_mini.png" alt="quant_contam" /> | Quantify ambient RNA per cell, infer its origins, and optionally adjust gene counts|Output from demux_vcf (plus optional single-cell expression data to adjust, in MEX format)|quant_contam|
| <img src="img/doublet_dragon_mini.png" alt="doublet_dragon" /> | Infer global doublet rate and proportions of individuals|Output from one or more CellBouncer programs run on the same cells|doublet_dragon|
| <img src="img/bulkprops_mini.png" alt="bulkprops" /> | Determine proportion of individuals in a pool|A VCF of known variants, plus a BAM of aligned sequence data (can be bulk)|bulkprops|
Visualizing and comparing results
|I want to...|I have...|Tool to use|
|------------|---------|-----------|
| Visualize a set of labels and the pool compositions they produce at different confidence cutoffs | An .assignments file from a CellBouncer program |plot/assignment_llr.R|
| Compare two sets of labels on the same cells | Two .assignments files from CellBouncer programs run on the same data | plot/compare_assignments.R|
| Merge two sets of labels on the same cells into one set of labels | Two .assignments files from CollBouncer programs run on the same data | utils/merge_assignments.R
| Compare two sets of pool proportions and assess significance if possible | Two files describing pool composition (i.e. from bulkprops or contamination profile from quant_contam), or one file describing pool composition and an .assignments file describing cell labels | utils/compare_props.R |
| Refine genotype calls to better match cell-individual labels | A preexisting set of genotypes in VCF format, a BAM file of aligned single-cell data, and an .assignments file mapping cells to individuals of origin | utils/refine_vcf |
| Plot species proportions | Output from demux_species | plot/species.R |
| Plot mitochondrial haplotypes | Output from demux_mt | plot/demux_mt_clust.R <br> plot/demux_mt_unclust.R |
| Plot ambient RNA profile | Output from quant_contam | plot/contam.R |
| Plot counts of cell identifications, according to different data types and to the consensus among them | Output from doublet_dragon | plot/doublet_dragon.R |
Manipulating input files
|I want to...|I have...|Tool to use|
|------------|---------|-----------|
|Split a BAM file into one file per cell identity | A BAM file of aligned single-cell sequencing data and a CellBouncer-format .assignments file | utils/bam_split_bcs |
|Tag reads in a BAM file to mark individual of origin | A BAM file of aligned single-cell sequencing data and a CellBouncer-format .assignments file | utils/bam_indiv_rg |
|Convert 10X or Scanpy (AnnData) data from .h5 to MEX format | A CellRanger-format .h5 or Scanpy-format .h5ad file | utils/h5tomex.py |
|Split MEX-format data into one data set per library/run | Single-cell expression data in MEX format | utils/split_mex_libs.py |
|Subset MEX-format data to specific cell barcodes | Single-cell expression data in MEX format | utils/subs_mex_bc.py |
|Subset MEX-format data to a specific feature type | Single-cell expression data in MEX format | utils/subs_mex_featuretype.py |
Installation
Using Docker
The included Dockerfile can be used to set up and compile everything required by CellBouncer.
Not using Docker
To install (see below):
- Clone the repository (and its submodules)
- Choose a
condaenvironment file to install- All necessary dependencies:
cellbouncer_minimum.yml - All necessary dependencies plus extra helper programs mentioned in documentation:
- Mac OS X:
cellbouncer_extra_osx.yml - Linux:
cellbouncer_extra.yml
- Mac OS X:
- All necessary dependencies:
- Run
make
For more information about installing or updating CellBouncer, see here.
Get the repository
git clone --recurse-submodules git@github.com:nkschaefer/cellbouncer.git
cd cellbouncer
Create conda environment
Linux
conda env create --file=cellbouncer_minimum.yml
conda activate cellbouncer
conda env config vars set LD_LIBRARY_PATH="${CONDA_PREFIX}/lib:${LD_LIBRARY_PATH}" -n cellbouncer
Mac OS X (M1)
CONDA_SUBDIR=osx-arm64 conda env create --file=cellbouncer_minimum.yml
conda activate cellbouncer
conda env config vars set DYLD_LIBRARY_PATH="${CONDA_PREFIX}/lib:${DYLD_LIBRARY_PATH}" -n cellbouncer
Mac OS X (Intel)
conda env create --file=cellbouncer_minimum.yml
conda activate cellbouncer
conda env config vars set DYLD_LIBRARY_PATH="${CONDA_PREFIX}/lib:${DYLD_LIBRARY_PATH}" -n cellbouncer
Compile
make
You've now got all the programs compiled, and you can run them as long as you remember to conda activate cellbouncer first.
Test data set
You can get a test data set from this link. It contains an example .bam file, vcf file, and cell hashing .counts file. The README in the linked directory will explain everything, but these give you the opportunity to test-run demux_species, demux_mt, demux_tags, demux_vcf, quant_contam, and doublet_dragon.
Overview
The programs in cellbouncer are standalone command line tools. If you run one of them with no arguments or with -h, it will give you detailed information about how to run it. Each program uses the concept of an --output_prefix/-o, which is a base name that will be used for all output files.
Output files
Demultiplexing tools all write a file called [output_prefix].assignments, which tells you information about each cell's identity. These files are 4 columns, tab separated:
- cell barcode (optionally with unique ID appended; see below)
- most likely identity (doublets are two names in alphabetical order separated by
+) - droplet type:
S(for singlet),D(for doublet), or in some casesM(for multiplet, 3+ individuals, so far only considered bydemux_tags) - ratio of the log likelihood of the best to the second best assignment (a measure of confidence in the assignment)
Cell barcode format and merging with other data
To load data from CellBouncer into a single cell analysis tool like Seurat or [scanpy](https://scanpy.readth
