CompAIRR

CompAIRR (compairr) is a command line tool to compare two sets of adaptive immune receptor repertoires and compute their overlap. It can also identify which sequences are present in which repertoires. Furthermore, CompAIRR can cluster the sequences in a repertoire set. Sequence comparisons can be exact or approximate. CompAIRR has been shown to be very fast and to have a small memory footprint compared to similar tools, when up to 2 differences are allowed.

Installation

The code is C++11 standard compliant and should compile easily using make and a modern C++ compiler (e.g. GNU GCC or LLVM Clang). Run make clean, make, make test and make install in the main folder to clean, build, test and install the tool. There are no dependencies except for the C and C++ standard libraries.

Binaries for Linux (x86_64) and macOS (x86_64 and Arm64) are also distributed with each release.

A Dockerfile is included if you want to make a Docker image. A docker image may be built with the following command:

docker build -t compairr .

Ready-made Docker images for CompAIRR can be found on the Docker Hub.

CompAIRR can be installed on macOS using homebrew with brew install torognes/bioinf/compairr.

Tutorial

For an introduction to how to use CompAIRR, please have a look at the CompAIRR tutorial.

General options

Use the -h or --help option to show some help information.

Run the program with -v or --version for version information.

The type of operation that should be performed is specified with one of the options -m, -x, -c or -z (or the corresponding long option forms --matrix, --existence, --cluster, or --deduplicate).

The code is multi-threaded. The number of threads may be specified with the -t or --threads option.

The results will be written to standard out (stdout) unless a file name has been specified with the -o or --output-file option.

While the program is running it will print some status and progress information to standard error (stderr) unless a log file has been specified with the -l or --log option. Error messages and warnings will also be written here.

The default is to compare amino acid sequences, but nucleotide sequences are compared if the -n or --nucleotides option is given. The accepted amino acid symbols are ACDEFGHIKLMNPQRSTVWY, while the accepted nucleotide symbols are ACGTU. Lower case letters are also accepted. The program will abort with an error message if any other symbol is encountered in a sequence, unless one specifies the -u or --ignore-unknown option, in which case CompAIRR will simply ignore that sequence. If the program encounters an empty sequence it will also abort with an error message, unless the -e or --ignore-empty option is given.

By default, the sequences should be given in the junction or junction_aa column of the input file, for nucleotide and amino acid sequences, respectively. Alternatively, the sequences may be present in the cdr3 or cdr3_aa column, if the --cdr3 option is given.

The user can specify how many differences are allowed when comparing sequences, using the option -d or --differences. To allow indels (insertions or deletions) the option -i or --indels may be specified, otherwise only substitutions are allowed. By default, no differences are allowed. The -i option is allowed only when d=1. The number of differences allowed strongly influences the speed of CompAIRR. The program will be slower as more differences are allowed. When d=0 or d=1 it is very fast, but it will be relatively slow with d=2 and even slower when d>2. See the section on performance below for an example.

The V and J gene alleles specified for each sequence must also match, unless the -g or --ignore-genes option is in effect.

Computing overlap between two repertoire sets

To compute the overlap between two repertoire sets, use the -m or --matrix option.

For each of the two repertoire sets there must an input file of tab-separated values formatted according to the AIRR standard for rearrangements. The two input files are specified on the command line without any preceding option letter. If only one filename is specified on the command line, or the same filename is specified twice, it is assumed that the set should be compared to itself. Each file must contain the repertoire ID and either the nucleotide or the amino acid sequence of the rearrangement. If the repertoire ID column is missing, all sequences are assumed to belong to the same repertoire (with ID 1 or 2, respectively, for the two sets). A sequence ID may also be included. Unless they should be ignored, the V gene, the J gene, and the duplicate count is also needed.

Each set can contain many repertoires and each repertoire can contain many sequences. The tool will find the sequences in the two sets that are similar and output a matrix with results.

CompAIRR assumes that all sequences within each repertoire are distinct, and that the abundance of each sequence is indicated in the duplicate_count field in the input file. Duplicated sequences, i.e. identical sequences (with the same V and J genes) within the same repertoire, may lead to unexpected results. CompAIRR will warn if it detects duplicates. Duplicates may be merged with the --deduplicate command.

The similar sequences of each repertoire in each set are found by comparing the sequences and their V and J genes. The duplicate count of each sequence is taken into account and a matrix is output containing a value for each combination of repertoires in the two sets. The value is usually the sum of the products of the duplicate counts of all pairs of sequences in the two repertoires that match. If the option -f or --ignore-counts is specified, the duplicate count information is ignored and all counts are treated as 1. Instead of summing the product of the counts, the ratio, min, max, or mean may be used if specified with the -s or --score option. The Morisita-Horn index or Jaccard index will be calculated if MH or Jaccard is specified with the -s option. These indices can only be computed when d=0.

The output will be a matrix of values in a tab-separated plain text file. Two different formats can be selected. In the default format, the first line contains the hash character (#) followed by the repertoire ID's from the second set. The following lines contains the repertoire ID from the first set, followed by the values corresponding to the comparison of this repertoire with each of the repertoires in the second set.

An alternative output format is used when the -a or --alternative option is specified. It will write the results in a three column format with the repertoire ID from set 1 and set 2 in the two first columns, respectively, and the value in the third column. There will be one line for each combination of repertoires in the sets. The very first line will contain a hash character (#) followed by the field names separated by tabs.

If the -p or --pairs option is specified, CompAIRR will write information about all pairs of matching sequences to a specified TSV file. Please note that such files may grow very large when there are many matches. Use of multithreading may be of little use in this case. The order of the lines in the file is unspecified. The following columns from both input files will be included in the output: repertoire_id, sequence_id, duplicate_count, v_call, j_call, and junction. The term junction will be replaced with junction_aa, cdr3, or cdr3_aa as appropriate. Additional columns from the input files may be copied to the pairs file using the -k or --keep-columns option. Multiple columns, separated by commas (but no spaces), may be given. A warning will be given if any of the specified columns are missing. In the header, columns from the first and second input file will be suffixed by _1 and _2, respectively. The distance between the sequences will be included if the --distance option is included. This is usually the Hamming distance (minimum number of substitutions), unless the --indel (or -i) option is specified, in which case the distance is the Levenshtein distance (minimum number of substitutions or indels). If only the information in the pairs file is required, and not the information in the matrix, the storage and output of the matrix can be avoided with the --no-matrix option. This may save some memory and time if there are many repertoires in the sets.

Analysing in which repertoires a set of sequences are present

Use the option -x or --existence to analyse in which repertoires a set of sequences are present, and create a sequence presence matrix.

Two input files with repertoire sets in standard format must be specified on the command line. The first file should contain the different sequences to analyse. The sequence_id column must be present in this file. If the optional repertoire_id column is present, all those identifiers must be identical. The second file must contain the repertoires to match. The repertoire_id column must be present in the second file, otherwise the ID will be set to 2 for all sequences.

CompAIRR will identify in which repertoires each sequence is present and will output the results either as a matrix or as a three-column table (if the -a option is specified). The options -d, -i, -g, and -n (and the corresponding long option names --differences, --indels, `-

Compairr

Install / Use

README