<a name="started"></a>Getting Started

ALERT: minimap2.com is a phishing site. Please don't use anything from that website.

git clone https://github.com/lh3/minimap2
cd minimap2 && make
# long sequences against a reference genome
./minimap2 -a test/MT-human.fa test/MT-orang.fa > test.sam
# create an index first and then map
./minimap2 -x map-ont -d MT-human-ont.mmi test/MT-human.fa
./minimap2 -a MT-human-ont.mmi test/MT-orang.fa > test.sam
# use presets (no test data)
./minimap2 -ax map-pb ref.fa pacbio.fq.gz > aln.sam       # PacBio CLR genomic reads
./minimap2 -ax map-ont ref.fa ont.fq.gz > aln.sam         # Oxford Nanopore genomic reads
./minimap2 -ax map-hifi ref.fa pacbio-ccs.fq.gz > aln.sam # PacBio HiFi/CCS genomic reads (v2.19+)
./minimap2 -ax lr:hq ref.fa ont-Q20.fq.gz > aln.sam       # Nanopore Q20 genomic reads (v2.27+)
./minimap2 -ax sr ref.fa read1.fa read2.fa > aln.sam      # short genomic paired-end reads
./minimap2 -ax splice ref.fa rna-reads.fa > aln.sam       # spliced long reads (strand unknown)
./minimap2 -ax splice -uf -k14 ref.fa reads.fa > aln.sam  # noisy Nanopore direct RNA-seq
./minimap2 -ax splice:hq -uf ref.fa query.fa > aln.sam    # PacBio Kinnex/Iso-seq (RNA-seq)
./minimap2 -ax splice --junc-bed=anno.bed12 ref.fa query.fa > aln.sam  # use annotated junctions
./minimap2 -ax splice:sr ref.fa r1.fq r2.fq > aln.sam     # short-read RNA-seq (v2.29+)
./minimap2 -ax splice:sr -j anno.bed12 ref.fa r1.fq r2.fq > aln.sam
./minimap2 -cx asm5 asm1.fa asm2.fa > aln.paf             # intra-species asm-to-asm alignment
./minimap2 -x ava-pb reads.fa reads.fa > overlaps.paf     # PacBio read overlap
./minimap2 -x ava-ont reads.fa reads.fa > overlaps.paf    # Nanopore read overlap
# man page for detailed command line options
man ./minimap2.1

Getting Started
Users' Guide
Developers' Guide
Limitations

<a name="uguide"></a>Users' Guide

Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%.

For ~10kb noisy reads sequences, minimap2 is tens of times faster than mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is more accurate on simulated long reads and produces biologically meaningful alignment ready for downstream analyses. For >100bp Illumina short reads, minimap2 is three times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data. Detailed evaluations are available from the [minimap2 paper][doi] or the [preprint][preprint].

<a name="install"></a>Installation

Minimap2 is optimized for x86-64 CPUs. You can acquire precompiled binaries from the [release page][release] with:

curl -L https://github.com/lh3/minimap2/releases/download/v2.30/minimap2-2.30_x64-linux.tar.bz2 | tar -jxvf -
./minimap2-2.30_x64-linux/minimap2

If you want to compile from the source, you need to have a C compiler, GNU make and zlib development files installed. Then type make in the source code directory to compile. If you see compilation errors, try make sse2only=1 to disable SSE4 code, which will make minimap2 slightly slower.

Minimap2 also works with ARM CPUs supporting the NEON instruction sets. To compile for 32 bit ARM architectures (such as ARMv7), use make arm_neon=1. To compile for for 64 bit ARM architectures (such as ARMv8), use make arm_neon=1 aarch64=1.

Minimap2 can use [SIMD Everywhere (SIMDe)][simde] library for porting implementation to the different SIMD instruction sets. To compile using SIMDe, use make -f Makefile.simde. To compile for ARM CPUs, use Makefile.simde with the ARM related command lines given above.

<a name="general"></a>General usage

Without any options, minimap2 takes a reference database and a query sequence file as input and produce approximate mapping, without base-level alignment (i.e. coordinates are only approximate and no CIGAR in output), in the [PAF format][paf]:

minimap2 ref.fa query.fq > approx-mapping.paf

You can ask minimap2 to generate CIGAR at the cg tag of PAF with:

minimap2 -c ref.fa query.fq > alignment.paf

or to output alignments in the [SAM format][sam]:

minimap2 -a ref.fa query.fq > alignment.sam

Minimap2 seamlessly works with gzip'd FASTA and FASTQ formats as input. You don't need to convert between FASTA and FASTQ or decompress gzip'd files first.

For the human reference genome, minimap2 takes a few minutes to generate a minimizer index for the reference before mapping. To reduce indexing time, you can optionally save the index with option -d and replace the reference sequence file with the index file on the minimap2 command line:

minimap2 -d ref.mmi ref.fa                     # indexing
minimap2 -a ref.mmi reads.fq > alignment.sam   # alignment

Importantly, it should be noted that once you build the index, indexing parameters such as -k, -w, -H and -I can't be changed during mapping. If you are running minimap2 for different data types, you will probably need to keep multiple indexes generated with different parameters. This makes minimap2 different from BWA which always uses the same index regardless of query data types.

<a name="cases"></a>Use cases

Minimap2 uses the same base algorithm for all applications. However, due to the different data types it supports (e.g. short vs long reads; DNA vs mRNA reads), minimap2 needs to be tuned for optimal performance and accuracy. It is usually recommended to choose a preset with option -x, which sets multiple parameters at the same time. The default setting is the same as map-ont.

<a name="map-long-genomic"></a>Map long noisy genomic reads

minimap2 -ax map-pb  ref.fa pacbio-reads.fq > aln.sam   # for PacBio CLR reads
minimap2 -ax map-ont ref.fa ont-reads.fq > aln.sam      # for Oxford Nanopore reads
minimap2 -ax map-iclr ref.fa iclr-reads.fq > aln.sam    # for Illumina Complete Long Reads

The difference between map-pb and map-ont is that map-pb uses homopolymer-compressed (HPC) minimizers as seeds, while map-ont uses ordinary minimizers as seeds. Empirical evaluation suggests HPC minimizers improve performance and sensitivity when aligning PacBio CLR reads, but hurt when aligning Nanopore reads. map-iclr uses an adjusted alignment scoring matrix that accounts for the low overall error rate in the reads, with transversion errors being less frequent than transitions.

<a name="map-long-splice"></a>Map long mRNA/cDNA reads

minimap2 -ax splice:hq -uf ref.fa iso-seq.fq > aln.sam       # PacBio Iso-seq/traditional cDNA
minimap2 -ax splice ref.fa nanopore-cdna.fa > aln.sam        # Nanopore 2D cDNA-seq
minimap2 -ax splice -uf -k14 ref.fa direct-rna.fq > aln.sam  # Nanopore Direct RNA-seq
minimap2 -ax splice --splice-flank=no SIRV.fa SIRV-seq.fa    # mapping against SIRV control

There are different long-read RNA-seq technologies, including tranditional full-length cDNA, EST, PacBio Iso-seq, Nanopore 2D cDNA-seq and Direct RNA-seq. They produce data of varying quality and properties. By default, -x splice assumes the read orientation relative to the transcript strand is unknown. It tries two rounds of alignment to infer the orientation and write the strand to the ts SAM/PAF tag if possible. For Iso-seq, Direct RNA-seq and tranditional full-length cDNAs, it would be desired to apply -u f to force minimap2 to consider the forward transcript strand only. This speeds up alignment with slight improvement to accuracy. For noisy Nanopore Direct RNA-seq reads, it is recommended to use a smaller k-mer size for increased sensitivity to the first or the last exons.

Minimap2 rates an alignment by the score of the max-scoring sub-segment, excluding introns, and marks the best alignment as primary in SAM. When a spliced gene also has unspliced pseudogenes, minimap2 slightly prefers the spliced alignment. By default, minimap2 outputs up to five secondary alignments (i.e. likely pseudogenes in the context of RNA-seq mapping). This can be tuned with option -N.

For long RNA-seq reads, minimap2 may produce chimeric alignments potentially caused by gene fusions/structural variations or by an intron longer than the max intr

Minimap2

Install / Use

README