Migmap

HTS-compatible wrapper for IgBlast V-(D)-J mapping tool

Generate Convert Improve

Install / Use

/learn @mikessh/Migmap

About this skill

Quality Score

0/100

README

MiGMAP: mapper for full-length T- and B-cell repertoire sequencing

In a nutshell, this software is a smart wrapper for IgBlast V-(D)-J mapping tool designed to facilitate analysis immune receptor libraries profiled using high-throughput sequencing. This package includes additional experimental modules for contig assembly, error correction and immunoglobulin lineage tree construction.

The software is distributed as an executable JAR file and a data bundle.

NOTE Last IgBlastWrp version is available here (source and readme are available here), this is a completely re-written version of original software.

Motivation

IgBlast is an excellent of V-(D)-J mapping tool able to correctly map even severely hypermutated antibody variants. While being a gold standard, the following limitations of IgBlast v1.4.0 have driven MIGMAP development:

It doesn't extract sequence of CDR3 region directly, neither provide coordinates for CDR3 region in reads. It reports reference Cys residue of Variable segment and Variable segment end in CDR3, but not Phe/Trp residue of J segment that marks the end of CDR3
Output is not straightforward to parse and summarize to a readable clonotype abundance table containing CDR3 sequences, segment assignments and list of somatic hypermutations
It doesn't account for sequence quality
It is somewhat hard to make it running with a custom segment reference and species other than human and mouse

Features

Present wrapper adds the following capabilities to IgBlast:

Run on FASTQ data.
Use a comprehensive V/D/J segment database for human, mouse, rat, rabbit and rhesus monkey.
Speed-up by piping reads to IgBlast and parsing the output in parallel as the built-in --num-threads argument doesn't offer much optimization.
Assemble clonotypes, apply various filtering options such as quality filtering for CDR3 N-regions and mutations, non-coding sequence filtering, etc.
Reporting mutations (including indels) in V, D and J segments, grouped by CDR/FW region, both on nucleotide and amino-acid level.
Frequency and parsimony-based error correction.
Includes a post-analysis module for quantification of somatic hypermutations and building clonotype trees, output compatible with VDJtools post-analysis software.

Pre-requisites

Java v1.8 or higher is required to run MIGMAP. Users should then install IgBlast v1.4.0 binaries that are appropriate for their system and make sure that igblastn and makeblastdb are added to $PATH or the directory that contains binaries is specified using --blast-dir /path/to/bin/ argument during MiGMAP execution. IgBlast v1.4.0 binaries can also be downloaded from here.

Note that MIGMAP also works with IgBlast v1.6.1, although this was not tested extensively

A data folder named data/ containing binary databases required for IgBlast to work is provided in the release bundle. It can also explicitly specify its path with --blast-dir /path/to/bin/ for troubleshooting purposes.

Installation

See latest release section for MiGMAP package. For Windows you need to both install IgBlast and download the latest release. For MacOS and Linux, MIGMAP can be easily installed using Homebrew/Linuxbrew or bioconda (no need to download anything/manually install IgBlast):

brew tap mikessh/repseq
brew install migmap-macos # or migmap-linux

Another option is to intall MIGMAP using BIOCONDA, see corresponding recipe.

MiGMAP can be compiled from sources using Gradle with gradle build. Note that in order for tests to pass IgBlast binaries should be in $PATH variable, you may need to modify following part of build.gradle

test {
    environment "PATH", "$System.env.PATH:/usr/local/bin/:/usr/local/ncbi/igblast/bin/"
}

Execution

General

To see the full list of MiGMAP options run

java -jar migmap.jar -h ...

The memory limit can be extended by using -Xmx argument (-Xmx8G will be sufficient in most cases). In case installed using Homebrew the command to execute MIGMAP is simply migmap ....

The following command will process sample.fastq.gz file containing human Immunoglobulin Heavy Chain reads, assemble clonotypes and store them in out.txt:

java -jar migmap.jar -R IGH -S human sample.fastq.gz out.txt

MIGMAP accepts both FASTQ and FASTA input files, raw and GZIP-compressed. MiGMAP can be also run in per-read mode and allows piping results, e.g.:

java -jar migmap.jar --by-read -R IGH -S human sample.fastq.gz - | grep "IGHV1-8" > out.txt

Several receptor chains can be specified, e.g. -R IGH,IGK,IGL. It is always recommended to map to complete set of TCR or IG genes and filter contaminations (e.g. TRA<>TRB) later.

The list of possible options is the following:

Option | Definition --------------------|------------------------------------------------------------------------ --blast-dir | Path to folder that contains igblastn and makeblastdb binaries, by default assume they are added to $PATH and execute them directly. --data-dir | Path to folder that contains the data bundle (internal_data/ and optional_file/ directories). By default it the package is provided with MIGMAP binaries, that is install_dir/data/. --custom-database | Path to a custom segments database. By default will use built-in database. See segments.txt and Using your own references for details. -n | Number of reads to analyze before stopping. Will analyze all reads by default -p | Number of threads to use. By default will use all available threads. --report | Path to the file that is going to store MIGMAP report (extraction and filtering efficiency for current input, etc). Will append report line if file exists. -R | REQUIRED Receptor gene and chain. Several chains can be specified, separated with commas. Allowed values are: IGH,IGL,IGK,TRA,TRB,TRG,TRD. -S | REQUIRED Species, allowed values: human,mouse,rat,rabbit,rhesus_monkey --all-alleles | Will use all alleles provided in the antigen receptor segment database. By default uses only major allele (*01 according to IMGT). --use-kabat | Will use KABAT nomenclature for FR/CDR region markup. Uses IMGT nomenclature by default. --allow-incomplete | Report clonotypes with partial CDR3 mapping (lacking J germline part, etc). By default those are no included into the output. --allow-no-cdr3 | Report clonotypes with no CDR3 mapping. By default those are no included into the output. --allow-noncoding | Report clonotypes that have either stop codon or frameshift in their receptor sequence. By default those are no included into the output. --allow-noncanonical | Report clonotypes that have non-canonical CDR3 (do not start with C or end with F/W residues). By default those are no included into the output. -q | Quality threshold, 2-40 defaults to 25. Filter out reads that have at least one error or N-nucleotide with a quality value below the corresponding threshold. --details | Specifies the nucleotide and amino acid sequences for certain FR/CDR regions that will be added to the output. Allowed values: fr1X,cdr1X,fr2X,cdr2X,fr3X,cdr3X,fr4X,contigX where X stands for nt or aa. By default all those fileds are included --by-read | Will output mapping results for each read (see Output format below, excluding frequency and count fields) and read headers. --unmapped | Specifies a file to store unmapped reads.

Additional routines

There are several built-in routines implementing common result processing and analysis tasks. Note that one should use -cp instead of -jar when executing the module and specify full package name, as shown below. When using -cp (classpath) for execution always make sure that the path to executable JAR file is set correctly, otherwise JVM will throw some uninformative error message.

Merging contigs

MergeContigs routine of MIGMAP merges clonotypes that are represented by embedded contigs. MIGMAP uses CDR3 nucleotide sequence, V and J segment names and mutation list to define a unique clonotype signature. Therefore in case of variable IG/TCR sequence coverage (can be a result of intrinsic library properties and/or read trimming) there is a chance of ambigous clonotypes: e.g. in case two reads coming from a clonotype with a mutation in FR1 region and one of the reads doesn't cover the FR1 two clonotypes will be reported by MIGMAP. To resolve these ambiguities run MergeContigs utility as follows:

java -cp migmap.jar com.antigenomics.migmap.MergeContigs out.txt out_merged.txt

This routine will generate a list of clonotypes represented by the set of longest completely overlapping contigs. Note: make sure that you haven't manually excluded contignt feature from output as in such case the routine will fail.

Error correction

PCR and sequencing errors, as well as hot-spot PCR errors in case of UMI correct data can generate a great deal of artificial (erroneous) clonotypes, especially in case of full-length IG sequence analysis. To filter erroneous sub-variants and append their read count to corresponding parent clonotypes, ex

Related Skills

node-connect

341.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

341.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.6k

Commit, push, and open a PR