<h1 align="center"><img width="200px" src="img/pbmm2.png"/></h1> <h1 align="center">pbmm2</h1> <p align="center">A minimap2 SMRT wrapper for PacBio data: native PacBio data in ⇨ native PacBio BAM out.</p>

pbmm2 is a SMRT C++ wrapper for minimap2's C API. Its purpose is to support native PacBio in- and output, provide sets of recommended parameters, generate sorted output on-the-fly, and postprocess alignments. Sorted output can be used directly for polishing using GenomicConsensus, if BAM has been used as input to pbmm2. Benchmarks show that pbmm2 outperforms BLASR in sequence identity, number of mapped bases, and especially runtime. pbmm2 is the official replacement for BLASR.

Binary Availability

Latest version can be installed via bioconda package pbmm2.

Please refer to our official pbbioconda page for information on Installation, Support, License, Copyright, and Disclaimer.

Latest Version

Version 26.1.99: Full changelog here

Usage

pbmm2 offers following tools

Tools:
    index      Index reference and store as .mmi file
    align      Align PacBio reads to reference sequences

Typical workflows

A. Generate index file for reference and reuse it to align reads
  $ pbmm2 index ref.fasta ref.mmi --preset SUBREAD
  $ pbmm2 align ref.mmi movie.subreads.bam ref.movie.bam --preset SUBREAD

B. Align reads and sort on-the-fly, with 4 alignment and 2 sort threads
  $ pbmm2 align ref.fasta movie.subreads.bam ref.movie.bam --preset SUBREAD --sort -j 4 -J 2

C. Align reads, sort on-the-fly, and create PBI
  $ pbmm2 align ref.fasta movie.subreadset.xml ref.movie.alignmentset.xml --preset SUBREAD --sort

D. Omit output file and stream BAM output to stdout
  $ pbmm2 align hg38.mmi movie1.subreadset.xml --preset SUBREAD | samtools sort > hg38.movie1.sorted.bam

E. Align CCS fastq input and sort output
  $ pbmm2 align ref.fasta movie.Q20.fastq ref.movie.bam --sort --rg '@RG\tID:myid\tSM:mysample'

Index

Indexing is optional, but recommended if you use the same reference with the same --preset multiple times.

Usage: pbmm2 index [options] <ref.fa|xml> <out.mmi>

Notes:

If you use an index file, you can't override parameters -k, -w, nor -u in pbmm2 align!
Minimap2 parameter -H (homopolymer-compressed k-mer) is always on for SUBREAD and UNROLLED presets and can be disabled with -u.
You can also use existing minimap2 .mmi files in pbmm2 align.

Align

The output argument is optional. If not provided, BAM output is streamed to stdout.

Usage: pbmm2 align [options] <ref.fa|xml|mmi> <in.bam|xml|fa|fq> [out.aligned.bam|xml]

Alignment Parallelization

The number of alignment threads can be specified with -j,--num-threads. If not specified, the maximum number of threads will be used, minus one thread for BAM IO and minus the number of threads specified for sorting.

Sorting

Sorted output can be generated using --sort.

Percentage: By default, 25% of threads specified with -j, maximum 8, are used for sorting. Example: --sort -j 12, 9 threads for alignment, 3 threads for sorting.

Manual override: To override the default percentage, -J,--sort-threads defines the explicit number of threads used for on-the-fly sorting. Example: --sort -j 12 -J 4, 12 threads for alignment, 4 threads for sorting.

The memory allocated per sort thread can be defined with -m,--sort-memory, accepting suffixes M,G.

Temporary files during sorting are stored in the current working directory, unless explicitly defined with environment variable TMPDIR. The path used for temporary files is also printed if --log-level DEBUG is set.

Benchmarks on human data have shown that 4 sort threads are recommended, but no more than 8 threads can be effectively leveraged, even with 70 cores used for alignment. It is recommended to provide more memory to each of a few sort threads, to avoid disk IO pressure, than providing less memory to each of many sort threads.

Input file types

Following compatibility table shows allowed input file types, output file types, compatibility with GenomicConsensus, and recommended --preset choice. More info about our dataset XML specification.

| Input | Output |GC | Preset | | ------------------------------------------| -------------------------------------- |:-:| :------: | | .bam (aligned or unaliged) | .bam | Y | | | .fasta / .fa / .fasta.gz / .fa.gz | .bam | N | | | .fastq / .fq / .fastq.gz / .fq.gz | .bam | N | | | .Q20.fastq / Q20.fastq.gz | .bam | N | CCS | | bam.fofn | .bam | N | | | fasta.fofn | .bam | N | | | fastq.fofn | .bam | N | | | .subreadset.xml | .bam \ .alignmentset.xml | Y | | | .consensusreadset.xml | .bam \ .consensusalignmentset.xml | Y | CCS | | .transcriptset.xml | .bam \ .transcriptalignmentset.xml | Y | ISOSEQ |

FASTA/Q input

In addition to native PacBio BAM input, reads can also be provided in FASTA and FASTQ formats, as shown above.

With FASTA/Q input, option --rg sets the read group. Example call:

pbmm2 align hg38.fasta movie.Q20.fastq hg38.movie.bam --rg '@RG\tID:myid\tSM:mysample'

All three reference file formats .fasta, .referenceset.xml, and .mmi can be combined with FASTA/Q input.

Multiple input files

pbmm2 supports the .fofn file type (File Of File Names), containing the same datatype. Supported are .fofn files with FASTA, FASTQ, or BAM.

Examples:

echo "m64001_190131_212703.Q20.fastq.gz" > myfiles.fofn
echo "m64001_190228_200412.Q20.fastq.gz" >> myfiles.fofn
pbmm2 align hg38.fasta myfiles.fofn hg38.myfiles.bam --rg '@RG\tID:myid\tSM:mysample'

ls *.subreads.bam > mymovies.fofn
pbmm2 align hg38.fasta mymovies.fofn hg38.mymovies.bam

FAQ

Which minimap2 version is used?

pbmm2 ≥v1.13.0: minimap2 v2.26
pbmm2 <v1.13.0: minimap2 v2.15

When are `pbi` files created?

Whenever the output is of type xml, a pbi file is being generated.

When are BAM index files created?

For sorted output via --sort, a bai file is being generated per default. You can switch to csi for larger genomes with --bam-index CSI or skip index generation completely with --bam-index NONE.

What are parameter sets and how can I override them?

Per default, pbmm2 uses recommended parameter sets to simplify the plethora of possible combinations. For this, we currently offer:

SUBREAD
CCS or HIFI (default)
ISOSEQ
UNROLLED

Parameter sets vary based on pbmm2 version and are explained in --help.

If you want to override any of the parameters of your chosen set, please use the respective options:

  -k   k-mer size (no larger than 28). [-1]
  -w   Minimizer window size. [-1]
  -u   Disable homopolymer-compressed k-mer (compression is active for SUBREAD & UNROLLED presets).
  -A   Matching score. [-1]
  -B   Mismatch penalty. [-1]
  -z   Z-drop score. [-1]
  -Z   Z-drop inversion score. [-1]
  -r   Bandwidth used in chaining and DP-based alignment. [-1]
  -g   Stop chain enlongation if there are no minimizers in N bp. [-1]

For the piece-wise linear gap penalties, use the following overrides, whereas a k-long gap costs min{o+ke,O+kE}:

  -o,--gap-open-1     Gap open penalty 1. [-1]
  -O,--gap-open-2     Gap open penalty 2. [-1]
  -e,--gap-extend-1   Gap extension penalty 1. [-1]
  -E,--gap-extend-2   Gap extension penalty 2. [-1]
  -L,--lj-min-ratio   Long join flank ratio. [-1]

For ISOSEQ, you can override additional parameters:

  -G                  Max intron length (changes -r). [-1]
  -C                  Cost for a non-canonical GT-AG splicing. [-1]
  --no-splice-flank   Do not prefer splice flanks GT-AG.

If you have suggestions for our default parameters or ideas for a new parameter set, please open a GitHub issue!

What other special parameters are used implicitly?

To achieve similar alignment behavior like blasr, we implicitly use following minimap2 parameters:

soft clipping with -Y
long cigars for tag CG with -L
X/= cigars instead of M with --eqx
no overlapping query intervals with repeated matches trimming
no secondary alignments are produced per default (overridable with --secondary)

What sequence identity filters does pbmm2 offer?

The idea of removing spurious or low-quality alignments is straightforward, but the exact definition of a threshold is tricky and varies between tools and applications. More on sequence identity from Heng Li.
pbmm2 offers following filters:

--min-concordance-perc, legacy mapped concordance filter, inherited from its predecessor BLASR (hidden option)
--min-id-perc, a sequence identity percentage filter defined as the BLAST identity (hidden option)
--min-gap-comp-id-perc, a gap-compressed sequence identity filter accounting insertions and deletions as single events only (default)

By default, (3) is set to 70%, (1) and (2) are deactivated. The problem wit

Pbmm2

Install / Use

README

Binary Availability

Latest Version

Usage

Typical workflows

Index

Align

Alignment Parallelization

Sorting

Input file types

FASTA/Q input

Multiple input files

FAQ

Which minimap2 version is used?

When are `pbi` files created?

When are BAM index files created?

What are parameter sets and how can I override them?

What other special parameters are used implicitly?

What sequence identity filters does pbmm2 offer?

Pbmm2

Install / Use

README

Binary Availability

Latest Version

Usage

Typical workflows

Index

Align

Alignment Parallelization

Sorting

Input file types

FASTA/Q input

Multiple input files

FAQ

Which minimap2 version is used?

When are pbi files created?

When are BAM index files created?

What are parameter sets and how can I override them?

What other special parameters are used implicitly?

What sequence identity filters does pbmm2 offer?

When are `pbi` files created?