VDJmatch: a software for database-guided prediction of T-cell receptor antigen specificity

VDJmatch is a command-line tool designed for matching T-cell receptor (TCR) repertoires against a database of TCR sequences with known antigen specificity. VDJmatch implements an API for interacting and querying the VDJdb database, and serves as a backend for VDJdb web browser. VDJmatch will automatically download and use the latest version of the VDJdb database, however, it is also possible to use a custom database provided by user if it matches VDJdb format specification.

VDJmatch accepts TCR clonotype table(s) as an input and relies on VDJtools framework to parse the output of commonly used immune repertoire sequencing (RepSeq) processing tools. See format section of VDJtools docs for the list of supported formats. Note that VDJmatch can be used with metadata semantics introduced by VDJtools to facilitate running annotation for multi-sample datasets.

Installing and running

VDJdb is distributed as an executable JAR that can be downloaded from the releases section, the software is cross-platform and requires Java v1.8 or higher to run.

To run the executable JAR use the java -jar path/to/vdjmatch-version.jar match [options] command as described below. Running without any [options] or with -h option will display the help message.

The latest version of VDJdb will be downloaded the first time you run VDJmatch. Note that in order to update to the most recent version next time, you will need to run java -jar path/to/vdjmatch-version.jar Update command.

VDJmatch command line options

The following syntax should be used to run VDJmatch for RepSeq sample(s)

java -Xmx4G -jar path/to/vdjmatch-version.jar match \
      [options] [sample1 sample2 sample3 ... if -m is not specified] output_prefix

First part of the command runs the JAR file and sets the memory limit to 4GB (should be increased in case JVM drops with heap size exception) and points to VDJmatch executable JAR (version should be replaced with the software version). The second part includes options, input samples and the prefix of output files.

General

| Option name | Argument example | Description | |-----------------------------|---------------------------------|-----------------------------------------------------------| | ‑h | | Display help message | | ‑m, ‑‑metadata | /path/to/metadata.txt | A metadata file, holding paths to samples and user-provided information. | | ‑‑software | MITCR,MIGEC,etc | Input RepSeq data format, see formats supported for conversion. By default expects input in VDJtools format. | | ‑c, ‑‑compress | | Compress sample-level summary output with GZIP. |

If -m option is specified the list of sample file names should be omitted and the list of options should be followed by output_prefix.

Database pre-filtering

| Option name | Argument example | Description | |-----------------------------|----------------------------------|--------------| | ‑S, ‑‑species | human,mouse,etc | (Required) Species name. All samples should belong to the same species, only one species is allowed. | | ‑R, ‑‑gene | TRA,TRB,etc | (Required) Name of the receptor gene. All samples should contain to the same receptor gene, only one gene is allowed. | | ‑‑filter | "__antigen.species__=~'EBV'" | (Advanced) Logical filter expresstion that will be evaluated for database columns. | | ‑‑vdjdb‑conf | 1 | VDJdb confidence level threshold, from 0 (lowest) to 3 (highest), default is 0. | | ‑‑min‑epi‑size | 10 | Minimal number of unique CDR3 sequences per epitope in VDJdb, filters underrepresented epitopes. Default is 10 |

The --filter option supports Java/Groovy syntax, Regex, .contains(), .startsWith(), etc. Parts of the expression marked with double underscore (__, e.g. __antigen.epitope__) will be substituted with corresponding values from database rows. Those parts should be named exactly as columns in the database, see VDJdb specification for the list of column names.

VDJdb confidence level used by --vdjdb-conf is assigned based on the details of TCR specificity assay for each VDJdb record, see VDJdb confidence scoring for details on this procedure.

Using external database (advanced)

| Option name | Argument example | Description | |-----------------------------|----------------------------------|--------------| | ‑‑database | /path/to/my_db | Path and prefix of an external database. Should point to files with '.txt', and '.meta.txt' suffices (the database itself and database metadata).| | ‑‑use‑fat‑db | | In case running with a built-in database, will use full database version instead of slim one. |

Full database contains extended info on method used to identify a given specific TCR and sample source, but has a higher degree of redundancy (several identical TCR:pMHC pairs from different publications, etc) that can complicate post-analysis

Search parameters

| Option name | Argument example | Description | |-----------------------------|----------------------------------|--------------| | ‑‑v‑match | | Require exact Variable segment ID match (ignoring alleles) when searching the database. | | ‑‑j‑match | | Require exact Joining segment ID match (ignoring alleles) when searching the database. | | ‑O, ‑‑search‑scope | 2,1,2, 3,0,0,3, ... | Sets CDR3 sequence search parameters aka search scope: allowed number of substitutions (s), insertions (i), deletions (d) / or indels (id) and total number of mutations (t). Default is 0,0,0 | | ‑‑search‑exhaustive | 0, 1 or 2 | Perform exhaustive CDR3 alignment: 0 - no (fast), 1 - check and select best alignment for smallest edit distance, 2 - select best alignment across all edit distances within search scope (slow). Default is 1.

Search scope should be specified in either s,i,d,t or s,id,t form. While the second form is symmetric and counts the sum of insertions and deletions (indels), the first form is not symmetric - insertions and deletions are counted with respect to the query TCR sequence (i.e. clonotype records from input samples). Total number of mutations t specifies the edit distance threshold.

Note that VDJmatch running time can greatly increase for large (wider than 4,2,4) search scopes.

With a --search-exhaustive 2 the algorithm will compute an exact global alignment for CDR3 sequences, which is quite slow, for a small/moderate search scope (2 or less indels) --search-exhaustive 1 is effectively the same as --search-exhaustive 2. Exhaustive search will choose the best alignment based on the VDJAM scoring (see below), this option has no effect if full VDJMATCH scoring is not used.

Scoring parameters

| Option name | Argument example | Description | |-----------------------------|----------------------------------|--------------| | ‑A, ‑‑scoring‑vdjmatch | | Use full VDJMATCH algorithm that computes full alignment score as a function of CDR3 mutations (weighted with VDJAM scoring matrix) and pre-computed V/J segment match scores. | | ‑‑scoring‑mode | 0 or 1 | Either 0: scores mismatches only (faster) or 1: compute scoring for whole sequences (slower). Default is 1. |

If --scoring-vdjmatch is not set, will just count the number of mismatches and ignore V/J segment alignment.

CDR3 alignment score is computed as:

--scoring-mode 0 $(CDR3_1, CDR3_2) = \sum_s [M(s_1,s_2) - max(M(s_1,s_1), M(s_2,s_2))] - \sum_{g} M(g_1,g_1)$ where $s: 1 \rightarrow 2$ stands for substitution, $g: 1 \rightarrow '-'$ stands for gap and $M(1,2)$ is the VDJAM matrix
--scoring-mode 1 $S(CDR3_1, CDR3_2) = aln(CDR3_1, CDR3_2) - max(aln(CDR3_1, CDR3_1), aln(CDR3_2, CDR3_2))$ where $aln(CDR3_1, CDR3_2)$ is the global alignment score without gap penalty between sequences $CDR3_1$ and $CDR3_2$ using VDJAM matrix

Full score / probabi

Vdjmatch

Install / Use

README