Clonotype Neighbor Graph Analysis (CoNGA) -- version 0.1.2

This repository contains the conga python package and associated scripts and workflows. conga was developed to detect correlation between T cell gene expression profile and TCR sequence in single-cell datasets. We have since added support for gamma delta TCRs and for B cells, too.

conga currently supports:

human TCRab, TCRgd, and Ig
mouse TCRab, TCRgd, and Ig
rhesus TCRab and TCRgd NEW

conga is in active development right now so the interface may change in the next few months. Questions and requests can be directed to pbradley at fredhutch dot org or stefan.schattgen at stjude dot org.

Further details on conga can be found in the Nature Biotechnology manuscript "Integrating T cell receptor sequences and transcriptional profiles by clonotype neighbor graph analysis (CoNGA)" by Stefan A. Schattgen, Kate Guion, Jeremy Chase Crawford, Aisha Souquette, Alvaro Martinez Barrio, Michael J.T. Stubbington, Paul G. Thomas, and Philip Bradley, accessible here: https://www.nature.com/articles/s41587-021-00989-2 (original BioRxiv preprint here).

Running

Running conga on a single-cell dataset is a two- (or more) step process, as outlined below. Python scripts are provided in the scripts/ directory but analysis steps can also be accessed interactively in jupyter notebooks (for example, a simple pipeline and Seurat to conga in the top directory of this repo) or in your own python scripts through the interface in the conga python package. There's also a google colab notebook which you can and run. If you want to experiment before installing CoNGA locally you can save a copy of that notebook to your google drive, edit and run the pipeline, either on the provided examples or on data that you upload to the colab instance. The examples in the examples/ folder described below and in the jupyter notebooks feature publicly available data from 10x Genomics, which can be downloaded in a single zip file or at the 10x genomics datasets webpage.

SETUP: The TCR data is converted to a form that can be read by conga and then a matrix of TCRdist distances is computed. KernelPCA is applied to this distance matrix to generate a PC matrix that can be used in clustering and dimensionality reduction. This is accomplished with the python script scripts/setup_10x_for_conga.py for 10x datasets. For example:

python conga/scripts/setup_10x_for_conga.py --filtered_contig_annotations_csvfile vdj_v1_hs_pbmc3_t_filtered_contig_annotations.csv --organism human

ANALYZE: The scripts/run_conga.py script has an implementation of the main pipeline and can be run as follows:

python conga/scripts/run_conga.py --graph_vs_graph --gex_data data/vdj_v1_hs_pbmc3_5gex_filtered_gene_bc_matrices_h5.h5 --gex_data_type 10x_h5 --clones_file vdj_v1_hs_pbmc3_t_filtered_contig_annotations_tcrdist_clones.tsv --organism human --outfile_prefix tmp_hs_pbmc3

RE-ANALYZE: Step 2 will generate a processed .h5ad file that contains all the gene expression and TCR sequence information along with the results of clustering and dimensionality reduction. It can then be much faster to perform subsequent re-analysis or downstream analysis by "restarting" from those files. Here we are using the --all command line flag which requests all the major analysis modes:

python conga/scripts/run_conga.py --restart tmp_hs_pbmc3_final.h5ad --all --outfile_prefix tmp_hs_pbmc3_restart

See the examples section below for more details.

Installation

We highly recommend installing CoNGA in a virtual environment, for example using the anaconda package manager. Linux folks can check out the Dockerfile for a minimal set of installation commands. At the top of the google colab jupyter notebook (link to the notebook on colab) are the necessary installation commands from within a notebook environment.

Details

Create a virtual enviroment and install required packages.

Here are some commands that would create an anaconda python environment for running CoNGA:

conda create -n conga_new_env ipython python=3.9
conda activate conga_new_env   # or: "source activate conga_new_env" depending on your conda setup
conda install seaborn scikit-learn statsmodels numba pytables
conda install -c conda-forge python-igraph leidenalg louvain notebook
conda install -c intel tbb # optional
pip install scanpy
pip install fastcluster # optional
conda install pyyaml #optional for using yaml-formatted configuration files for scripts

Clone the conga github repository

Type this command wherever you want the conga/ directory to appear:

git clone https://github.com/phbradley/conga.git

If you don't have git installed you could go click on the big green Code button on the CoNGA github page and download and unpack the software that way.

Compile C++ programs (optional, but highly recommended)

NEW We recently added a C++ implementation of TCRdist to speed neighbor calculations on large datasets and to compute the background TCRdist distributions for the new 'TCR clumping' analysis. This is not required by the core functionality described in the original manuscript, but we highly recommend that you compile the C++ TCRdist code using your C++ compiler.

We've successfullly used g++ from the GNU Compiler Collection (https://gcc.gnu.org/) to compile on Linux and MacOS, and from MinGw (http://www.mingw.org/) for Windows.

Using make on Linux or MacOS. (You can edit conga/tcrdist_cpp/Makefile to point to a C++ compiler other than g++)

cd conga/tcrdist_cpp
make

Or without make (for Windows)

cd conga/tcrdist_cpp
g++ -O3 -std=c++11 -Wall -I ./include/ -o ./bin/find_neighbors ./src/find_neighbors.cc
g++ -O3 -std=c++11 -Wall -I ./include/ -o ./bin/calc_distributions ./src/calc_distributions.cc
g++ -O3 -std=c++11 -Wall -I ./include/ -o ./bin/find_paired_matches ./src/find_paired_matches.cc

Install conga into your virtually environment.

cd to the top-most conga directory and make sure your virtual environment is activated. Then install conga into the environment with pip:

pip install -e .

Ensure you have a tool for SVG to PNG conversion available.

See the section below on SVG to PNG conversion for more details.

Even more details

The calculations in the conga manuscript were conducted with the following package versions:

scanpy==1.4.3 anndata==0.6.18 umap-learn==0.3.9 numpy==1.16.2 scipy==1.2.1 pandas==0.24.1 scikit-learn==0.20.2 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1

which might possibly be installed with the following conda command:

conda create -n conga_classic_env ipython python=3.6 scanpy=1.4.3 umap-learn=0.3.9 louvain=0.6.1

Migrating Seurat data to CoNGA

We recommend using the write10XCounts function from the DropletUtils package for converting Seurat objects into 10x format for importing into CoNGA/scanpy.

require(Seurat)
require(DropletUtils)
hs1 <- readRDS('~/vdj_v1_hs_V1_sc_5gex.rds')

If the object contains only gene expression:

write10xCounts(x = hs1@assays$RNA@counts, path = './hs1_mtx/')
# import the hs1_mtx directory into CoNGA using the '10x_mtx' option

If the object contains both gene expression and antibody labeling:

# Concatenate the GEX and antibody labeling count matrices
# Here, ADT is the antibody labeling assay slot.

count_matrix <- rbind(hs1@assays$RNA@counts, hs1@assays$ADT@counts)

# create vector of feature type labels
features <- c(
  rep("Gene Expression", nrow(hs1@assays$RNA@counts)), 
  rep("Antibody Capture", nrow(hs1@assays$ADT@counts))
  )
              
# write out              
write10xCounts( count_matrix, 
                path = './hs1_mtx/',
                gene.id = rownames(count_matrix),
                gene.symbol = rownames(count_matrix),
                barcodes = colnames(count_matrix),
                gene.type = features,
                version = "3")
# import the hs1_mtx directory into CoNGA using the '10x_mtx' option

Merging multiple datasets into a single object for CoNGA analysis

This can be done in two easy steps using the setup_10x_clones.py and merge_samples.py

Conga

Install / Use

README