Compairr
Comparison of Adaptive Immune Receptor Repertoires
Install / Use
/learn @uio-bmi/CompairrREADME
CompAIRR
CompAIRR (compairr) is a command line tool to compare two sets of
adaptive immune receptor repertoires and compute their overlap. It can
also identify which sequences are present in which repertoires.
Furthermore, CompAIRR can cluster the sequences in a repertoire
set. Sequence comparisons can be exact or approximate. CompAIRR has
been shown to be very fast and to have a small memory footprint
compared to similar tools, when up to 2 differences are allowed.
Installation
The code is C++11 standard compliant and should compile easily using
make and a modern C++ compiler (e.g. GNU GCC or LLVM Clang). Run
make clean, make, make test and make install in the main
folder to clean, build, test and install the tool. There are no
dependencies except for the C and C++ standard libraries.
Binaries for Linux (x86_64) and macOS (x86_64 and Arm64) are also distributed with each release.
A Dockerfile is included if you want to make a Docker image. A
docker image may be built with the following command:
docker build -t compairr .
Ready-made Docker images for CompAIRR can be found on the Docker Hub.
CompAIRR can be installed on macOS using homebrew with
brew install torognes/bioinf/compairr.
Tutorial
For an introduction to how to use CompAIRR, please have a look at the CompAIRR tutorial.
General options
Use the -h or --help option to show some help information.
Run the program with -v or --version for version information.
The type of operation that should be performed is specified with one
of the options -m, -x, -c or -z (or the corresponding long option
forms --matrix, --existence, --cluster, or --deduplicate).
The code is multi-threaded. The number of threads may be specified
with the -t or --threads option.
The results will be written to standard out (stdout) unless a file
name has been specified with the -o or --output-file option.
While the program is running it will print some status and progress
information to standard error (stderr) unless a log file has been
specified with the -l or --log option. Error messages and warnings
will also be written here.
The default is to compare amino acid sequences, but nucleotide
sequences are compared if the -n or --nucleotides option is given.
The accepted amino acid symbols are ACDEFGHIKLMNPQRSTVWY, while the
accepted nucleotide symbols are ACGTU. Lower case letters are also
accepted. The program will abort with an error message if any other
symbol is encountered in a sequence, unless one specifies the -u or
--ignore-unknown option, in which case CompAIRR will simply ignore
that sequence. If the program encounters an empty sequence it will
also abort with an error message, unless the -e or --ignore-empty
option is given.
By default, the sequences should be given in the junction or
junction_aa column of the input file, for nucleotide and amino acid
sequences, respectively. Alternatively, the sequences may be present
in the cdr3 or cdr3_aa column, if the --cdr3 option is given.
The user can specify how many differences are allowed when comparing
sequences, using the option -d or --differences. To allow indels
(insertions or deletions) the option -i or --indels may be
specified, otherwise only substitutions are allowed. By default, no
differences are allowed. The -i option is allowed only when d=1. The
number of differences allowed strongly influences the speed of
CompAIRR. The program will be slower as more differences
are allowed. When d=0 or d=1 it is very fast, but it will be relatively
slow with d=2 and even slower when d>2. See the section on performance
below for an example.
The V and J gene alleles specified for each sequence must also match,
unless the -g or --ignore-genes option is in effect.
Computing overlap between two repertoire sets
To compute the overlap between two repertoire sets, use the -m or
--matrix option.
For each of the two repertoire sets there must an input file of tab-separated values formatted according to the AIRR standard for rearrangements. The two input files are specified on the command line without any preceding option letter. If only one filename is specified on the command line, or the same filename is specified twice, it is assumed that the set should be compared to itself. Each file must contain the repertoire ID and either the nucleotide or the amino acid sequence of the rearrangement. If the repertoire ID column is missing, all sequences are assumed to belong to the same repertoire (with ID 1 or 2, respectively, for the two sets). A sequence ID may also be included. Unless they should be ignored, the V gene, the J gene, and the duplicate count is also needed.
Each set can contain many repertoires and each repertoire can contain many sequences. The tool will find the sequences in the two sets that are similar and output a matrix with results.
CompAIRR assumes that all sequences within each repertoire are
distinct, and that the abundance of each sequence is indicated in the
duplicate_count field in the input file. Duplicated sequences,
i.e. identical sequences (with the same V and J genes) within the same
repertoire, may lead to unexpected results. CompAIRR will warn if it
detects duplicates. Duplicates may be merged with the --deduplicate
command.
The similar sequences of each repertoire in each set are found by
comparing the sequences and their V and J genes. The duplicate count
of each sequence is taken into account and a matrix is output
containing a value for each combination of repertoires in the two
sets. The value is usually the sum of the products of the duplicate
counts of all pairs of sequences in the two repertoires that match. If
the option -f or --ignore-counts is specified, the duplicate count
information is ignored and all counts are treated as 1. Instead of
summing the product of the counts, the ratio, min, max, or mean may be
used if specified with the -s or --score option. The Morisita-Horn
index or Jaccard index will be calculated if MH or Jaccard is
specified with the -s option. These indices can only be computed
when d=0.
The output will be a matrix of values in a tab-separated plain text
file. Two different formats can be selected. In the default format,
the first line contains the hash character (#) followed by the
repertoire ID's from the second set. The following lines contains the
repertoire ID from the first set, followed by the values corresponding
to the comparison of this repertoire with each of the repertoires in
the second set.
An alternative output format is used when the -a or --alternative
option is specified. It will write the results in a three column
format with the repertoire ID from set 1 and set 2 in the two first
columns, respectively, and the value in the third column. There will
be one line for each combination of repertoires in the sets. The very
first line will contain a hash character (#) followed by the field
names separated by tabs.
If the -p or --pairs option is specified, CompAIRR will write
information about all pairs of matching sequences to a specified TSV
file. Please note that such files may grow very large when there are
many matches. Use of multithreading may be of little use in this
case. The order of the lines in the file is unspecified. The following
columns from both input files will be included in the output:
repertoire_id, sequence_id, duplicate_count, v_call, j_call,
and junction. The term junction will be replaced with
junction_aa, cdr3, or cdr3_aa as appropriate. Additional columns
from the input files may be copied to the pairs file using the -k or
--keep-columns option. Multiple columns, separated by commas (but no
spaces), may be given. A warning will be given if any of the specified
columns are missing. In the header, columns from the first and second
input file will be suffixed by _1 and _2, respectively. The
distance between the sequences will be included if the --distance
option is included. This is usually the Hamming distance (minimum
number of substitutions), unless the --indel (or -i) option is
specified, in which case the distance is the Levenshtein distance
(minimum number of substitutions or indels). If only the information
in the pairs file is required, and not the information in the matrix,
the storage and output of the matrix can be avoided with the
--no-matrix option. This may save some memory and time if there are
many repertoires in the sets.
Analysing in which repertoires a set of sequences are present
Use the option -x or --existence to analyse in which repertoires a
set of sequences are present, and create a sequence presence matrix.
Two input files with repertoire sets in standard format must be
specified on the command line. The first file should contain the
different sequences to analyse. The sequence_id column must be
present in this file. If the optional repertoire_id column is
present, all those identifiers must be identical. The second file must
contain the repertoires to match. The repertoire_id column must be
present in the second file, otherwise the ID will be set to 2 for all
sequences.
CompAIRR will identify in which repertoires each sequence is present
and will output the results either as a matrix or as a three-column
table (if the -a option is specified). The options -d, -i, -g,
and -n (and the corresponding long option names --differences,
--indels, `-
