Conga
Clonotype Neighbor Graph Analysis
Install / Use
/learn @phbradley/CongaREADME
Clonotype Neighbor Graph Analysis (CoNGA) -- version 0.1.2
This repository contains the conga python package and associated scripts
and workflows. conga was developed to detect correlation between
T cell gene expression profile and TCR sequence in single-cell datasets.
We have since added support for gamma delta TCRs and for B cells, too.
conga currently supports:
- human TCRab, TCRgd, and Ig
- mouse TCRab, TCRgd, and Ig
- rhesus TCRab and TCRgd NEW
conga is in active development right now so the interface may change in
the next few months. Questions and requests can be directed to pbradley at fredhutch dot org or
stefan.schattgen at stjude dot org.
Further details on conga can be found in the Nature Biotechnology manuscript
"Integrating T cell receptor sequences and transcriptional profiles by clonotype neighbor graph analysis (CoNGA)"
by Stefan A. Schattgen, Kate Guion, Jeremy Chase Crawford, Aisha Souquette, Alvaro Martinez Barrio, Michael J.T. Stubbington,
Paul G. Thomas, and Philip Bradley, accessible here:
https://www.nature.com/articles/s41587-021-00989-2
(original BioRxiv preprint
here).
Table of Contents
- Running
- Installation
- Migrating Seurat data to CoNGA
- Merging multiple datasets for CoNGA analysis
- Updates
- SVG to PNG
- Testing CoNGA without going through the pain of installing it
- Examples
- The CoNGA data model: where stuff is stored
- Frequently Asked Questions
Running
Running conga on a single-cell dataset is a two- (or more) step process, as outlined below.
Python scripts are provided in the scripts/ directory but analysis steps can also be accessed interactively
in jupyter notebooks (for example, a simple pipeline and
Seurat to conga in the top directory of this repo)
or in your own python scripts through the interface in the conga python package.
There's also a google colab notebook which you can
and run. If you want to
experiment before installing CoNGA locally you can save a copy of that notebook
to your google drive, edit and run the pipeline, either on the provided examples or on
data that you upload to the colab instance.
The examples in the
examples/ folder described below and in the jupyter notebooks feature publicly available data from 10x Genomics,
which can be downloaded in a single
zip file or at the
10x genomics datasets webpage.
- SETUP: The TCR data is converted to a form that can be read by
congaand then a matrix ofTCRdistdistances is computed. KernelPCA is applied to this distance matrix to generate a PC matrix that can be used in clustering and dimensionality reduction. This is accomplished with the python scriptscripts/setup_10x_for_conga.pyfor 10x datasets. For example:
python conga/scripts/setup_10x_for_conga.py --filtered_contig_annotations_csvfile vdj_v1_hs_pbmc3_t_filtered_contig_annotations.csv --organism human
- ANALYZE: The
scripts/run_conga.pyscript has an implementation of the main pipeline and can be run as follows:
python conga/scripts/run_conga.py --graph_vs_graph --gex_data data/vdj_v1_hs_pbmc3_5gex_filtered_gene_bc_matrices_h5.h5 --gex_data_type 10x_h5 --clones_file vdj_v1_hs_pbmc3_t_filtered_contig_annotations_tcrdist_clones.tsv --organism human --outfile_prefix tmp_hs_pbmc3
- RE-ANALYZE: Step 2 will generate a processed
.h5adfile that contains all the gene expression and TCR sequence information along with the results of clustering and dimensionality reduction. It can then be much faster to perform subsequent re-analysis or downstream analysis by "restarting" from those files. Here we are using the--allcommand line flag which requests all the major analysis modes:
python conga/scripts/run_conga.py --restart tmp_hs_pbmc3_final.h5ad --all --outfile_prefix tmp_hs_pbmc3_restart
See the examples section below for more details.
Installation
We highly recommend installing CoNGA in a virtual environment, for example using the
anaconda package manager. Linux folks can check out the
Dockerfile for a minimal set of installation commands. At the
top of the google colab jupyter notebook
(link to the notebook on colab)
are the necessary installation commands from within a notebook environment.
Details
- Create a virtual enviroment and install required packages.
Here are some commands that would create an anaconda python environment for
running CoNGA:
conda create -n conga_new_env ipython python=3.9
conda activate conga_new_env # or: "source activate conga_new_env" depending on your conda setup
conda install seaborn scikit-learn statsmodels numba pytables
conda install -c conda-forge python-igraph leidenalg louvain notebook
conda install -c intel tbb # optional
pip install scanpy
pip install fastcluster # optional
conda install pyyaml #optional for using yaml-formatted configuration files for scripts
- Clone the
congagithub repository
Type this command wherever you want the conga/ directory to appear:
git clone https://github.com/phbradley/conga.git
If you don't have git installed you could go click on the big green Code
button on the CoNGA github page and
download and unpack the software that way.
- Compile C++ programs (optional, but highly recommended)
NEW We recently added a C++ implementation of TCRdist to speed neighbor calculations on large datasets and to compute the background TCRdist distributions for the new 'TCR clumping' analysis. This is not required by the core functionality described in the original manuscript, but we highly recommend that you compile the C++ TCRdist code using your C++ compiler.
We've successfullly used g++ from the GNU Compiler Collection (https://gcc.gnu.org/) to compile on
Linux and MacOS, and from MinGw (http://www.mingw.org/) for Windows.
Using make on Linux or MacOS. (You can edit conga/tcrdist_cpp/Makefile to
point to a C++ compiler other than g++)
cd conga/tcrdist_cpp
make
Or without make (for Windows)
cd conga/tcrdist_cpp
g++ -O3 -std=c++11 -Wall -I ./include/ -o ./bin/find_neighbors ./src/find_neighbors.cc
g++ -O3 -std=c++11 -Wall -I ./include/ -o ./bin/calc_distributions ./src/calc_distributions.cc
g++ -O3 -std=c++11 -Wall -I ./include/ -o ./bin/find_paired_matches ./src/find_paired_matches.cc
- Install
congainto your virtually environment.
cd to the top-most conga directory and make sure your virtual environment is activated.
Then install conga into the environment with pip:
pip install -e .
- Ensure you have a tool for SVG to PNG conversion available.
See the section below on SVG to PNG conversion for more details.
Even more details
The calculations in the
conga manuscript were conducted with the following package versions:
scanpy==1.4.3 anndata==0.6.18 umap-learn==0.3.9 numpy==1.16.2 scipy==1.2.1 pandas==0.24.1 scikit-learn==0.20.2 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1
which might possibly be installed with the following conda command:
conda create -n conga_classic_env ipython python=3.6 scanpy=1.4.3 umap-learn=0.3.9 louvain=0.6.1
Migrating Seurat data to CoNGA
We recommend using the write10XCounts function from the DropletUtils package for converting Seurat objects into 10x format for importing into CoNGA/scanpy.
require(Seurat)
require(DropletUtils)
hs1 <- readRDS('~/vdj_v1_hs_V1_sc_5gex.rds')
If the object contains only gene expression:
write10xCounts(x = hs1@assays$RNA@counts, path = './hs1_mtx/')
# import the hs1_mtx directory into CoNGA using the '10x_mtx' option
If the object contains both gene expression and antibody labeling:
# Concatenate the GEX and antibody labeling count matrices
# Here, ADT is the antibody labeling assay slot.
count_matrix <- rbind(hs1@assays$RNA@counts, hs1@assays$ADT@counts)
# create vector of feature type labels
features <- c(
rep("Gene Expression", nrow(hs1@assays$RNA@counts)),
rep("Antibody Capture", nrow(hs1@assays$ADT@counts))
)
# write out
write10xCounts( count_matrix,
path = './hs1_mtx/',
gene.id = rownames(count_matrix),
gene.symbol = rownames(count_matrix),
barcodes = colnames(count_matrix),
gene.type = features,
version = "3")
# import the hs1_mtx directory into CoNGA using the '10x_mtx' option
Merging multiple datasets into a single object for CoNGA analysis
This can be done in two easy steps using the setup_10x_clones.py and merge_samples.py
