<h1 align="center"><img width="300px" src="img/sv2.png"/></h1> <h1 align="center">pbsv</h1> <p align="center">PacBio structural variant (SV) calling and analysis tools</p>

pbsv is a suite of tools to call and analyze structural variants in diploid genomes from PacBio single molecule real-time sequencing (SMRT) reads. The tools power the Structural Variant Calling analysis workflow in PacBio's SMRT Link GUI.

pbsv calls insertions, deletions, inversions, duplications, and translocations. Both single-sample calling and joint (multi-sample) calling are provided. pbsv is most effective for:

insertions 20 bp to 10 kb
deletions 20 bp to 100 kb
inversions 200 bp to 10 kb
duplications 20 bp to 10 kb
translocations between different chromosomes or further than 100kb apart on a single chromosome

Availability

Latest version can be installed via bioconda package pbsv.

Please refer to our official pbbioconda page for information on Installation, Support, License, Copyright, and Disclaimer.

Latest Version

Version 2.10.0: Full changelog here

Workflow

The general pbsv workflow is:

Align PacBio reads to a reference genome, per movie. (.subreads.bam/.ccs.fastq to .bam)
Discover signatures of structural variation. (.bam to .svsig.gz)
Call structural variants and assign genotypes, all samples. (.svsig.gz to .vcf)

1. Align PacBio reads to a reference genome

The recommended aligner is pbmm2 that can be installed via conda install pbmm2. For each movie (.subreads.bam or .ccs.fq) align records to a reference genome (ref.fa).

Subreads BAM input:

pbmm2 align ref.fa movie1.subreads.bam ref.movie1.bam --sort --median-filter --sample sample1

CCS BAM input:

pbmm2 align ref.fa movie1.ccs.bam ref.movie1.bam --sort --preset CCS --sample sample1

CCS FASTQ input:

pbmm2 align ref.fa movie1.Q20.fastq ref.movie1.bam --sort --preset CCS --sample sample1 --rg '@RG\tID:movie1'

The sample name, stored in the SM tag of the read groups, associates aligned reads with a particular sample. It is required for downstream joint calling.

2. Discover signatures of structural variation

For each aligned BAM or set of aligned BAMs, identify signatures of structural variation. This reduces all aligned reads to those that are relevant to calling structural variants. The signatures are stored in a .svsig.gz file.

pbsv discover ref.movie1.bam ref.sample1.svsig.gz
pbsv discover ref.movie2.bam ref.sample2.svsig.gz

# optionally index svsig.gz to allow random access via `pbsv call -r`
tabix -c '#' -s 3 -b 4 -e 4 ref.sample1.svsig.gz
tabix -c '#' -s 3 -b 4 -e 4 ref.sample2.svsig.gz

It is highly recommended to provide one tandem repeat annotation .bed file of your reference to pbsv discover via --tandem-repeats. This increases sensitivity and recall. Feel free to use the following for human SV calling: GRCh38 or hs37d5/hg19.

Sample names are transferred from the RG headers to the .svsig.gz file.

3. Call structural variants and assign genotypes

Call structural variants from structural variant signatures, jointly for all samples of interest. One or more .svsig.gz files are accepted, including multiple .svsig.gz for a single sample and/or svsig.gz for multiple samples. If the input is CCS reads, please add --ccs to the following call:

pbsv call ref.fa ref.sample1.svsig.gz ref.sample2.svsig.gz ref.var.vcf

Variant calls for all samples are output in a single .vcf file.

Parallel processing per chromosome

For large genomes with high sequencing coverage, it is recommended to process chromosomes separately. After aligning each movie:

Generate separate `.svsig.gz` files per chromosome

for i in $(samtools view -H hg38.movie1.bam | grep '^@SQ' | cut -f2 | cut -d':' -f2); do
    pbsv discover --region $i hg38.movie1.bam hg38.sample1.$i.svsig.gz
done

Call SVs

# -j is number of threads
pbsv call -j 8 hg38.fa hg38.sample1.*.svsig.gz hg38.sample1.vcf

Algorithm Overview and Advanced Parameters

Deletions

Cluster options used during pbsv call:

SV Signature Cluster Options:
  --cluster-max-length-perc-diff   Do not cluster signatures with difference in length > P%. [25]
  --cluster-max-ref-pos-diff       Do not cluster signatures > N bp apart in reference. [200]

Number of flanks used for consensus generation:

Consensus Options:
  -x,--max-consensus-coverage      Limit to N reads for variant consensus. [20]

Split deletions: Deletions that are not fully aligned using the D cigar are recovered up to a size of 100kb. Deletions greater than 100kb are currently called as translocations.

Insertions

Insertion calling workflow is identical to the above described deletion workflow, except for one additional criteria, the inserted sequence similarity check during clustering:

SV Signature Cluster Options:
  --cluster-min-basepair-perc-id   Do not cluster signatures with basepair identity < P%. [10]

Split insertions: Insertions can be recovered from split split reads, if the reference overlap of those split reads is less than -k,--max-skip-split [100] and if the query distance is larger than 500 bp.

The explicit upper limit on the insertion size can be adjusted in pbsv call, but be aware that predicting larger insertions will consume more memory!

  --max-ins-length   Ignore insertions with length > N bp. ["15K"]

Inversions

An inversion signature is detected if a single read is split into three alignments with different orientations / strands, either +-+ or -+-. Gaps in the reference, as depicted as A, are not constrained. The maximum permitted reference overlap B, between consecutive alignments, is configured in pbsv discover:

  -k,--max-skip-split   Ignore alignment pairs separated by > N bp of a read or reference. ["100"]

Clustering is performed on the inverted segment and uses the same criteria as deletion clustering.

The VCF call marks the most likely position and size of the inverted segment, as shown in this IGV screenshot:

Translocations

Translocations are identified using breakends of individual split reads with a query skip of less than -k,--max-skip-split [100]. The minimum reads that support a BND (total over all samples), can be defined with --call-min-bnd-reads-all-samples [2]. All four breakend combinations are supported:

Duplications

From split reads

Duplications can be identified from the following split-read signatures:

whereas a duplication has to include on fully spanning read to be flagged as PASS; otherwise, it is filtered with NotFullySpanned.

The maximum size can be configures in pbsv call:

  --max-dup-length   Ignore duplications with length > N bp. ["1M"]

From insertion

In addition, each insertion is tested against its neighboring reference regions and labeled as a duplication if it matches the reference with 80%. Caution: Activating duplication calling has a negative impact when comparing to GIAB, as GIAB labels everything as insertion or deletion.

Calling and Genotyping

An variant is output if it passes all of the following criteria:

supported by at least -A,--call-min-reads-all-samples [3] reads total across samples,
supported by at least -O,--call-min-reads-one-samples [3] in a sample,
supported by at least -S,--call-min-read-per-one-sample [1] percent of reads in a sample,
supported by at least -B,--call-min-bnd-reads-all-samples [2] reads total across samples for BNDs,
supported by at least -P,--call-min-reads-per-strand-all-samples [20] reads per strand total across samples,
assigned a non-reference genotype in at least one sample; a sample is assigned a non-reference genotype for a variant if at least --gt-min-reads [1] reads support the variant.

For CCS input, using the --ccs mode in pbsv call, thresholds are relaxed to -A 3 -B 2 -O 3 -S 0 -P 10.

Filtering

The VCF filter column is

PASS
NearReferenceGap: variant is near (< --filter-near-reference-gap [1K]) from a gap (run of >= 50 Ns in the reference assembly)
Decoy: variant involves a decoy sequence, where the chromosome name contains decoy, hs37d5, or hs38d1
NearContigEnd: variant is near (< --filter-near-contig-end [1K]) from a contig end
InsufficientStrandEvidence: variant is not supported by at least (--call-min-reads-per-strand-all-samples [1]) reads in forward and reverse orientation
NotFullySpanned: duplication variant does not have any fully spanning reads

Performance benchmarks

Using the publicly available HG002 15kb CCS dataset, we are tracking pbsv performance with respect to the genome in a bottle annotation version 0.6.

Step-by-step reproducibility and benchmarks can be found at github.com/PacificBiosciences/sv-benchmark.

FAQ

To where do I report bugs and ask questions about the pre-release version of `pbsv`?

Please refer to our official pbbioconda page to report bugs and ask questions.

Pbsv

Install / Use

README

Availability

Latest Version

Workflow

1. Align PacBio reads to a reference genome

2. Discover signatures of structural variation

3. Call structural variants and assign genotypes

Parallel processing per chromosome

Generate separate `.svsig.gz` files per chromosome

Call SVs

Algorithm Overview and Advanced Parameters

Deletions

Insertions

Inversions

Translocations

Duplications

From split reads

From insertion

Calling and Genotyping

Filtering

Performance benchmarks

FAQ

To where do I report bugs and ask questions about the pre-release version of `pbsv`?

Pbsv

Install / Use

README

Availability

Latest Version

Workflow

1. Align PacBio reads to a reference genome

2. Discover signatures of structural variation

3. Call structural variants and assign genotypes

Parallel processing per chromosome

Generate separate .svsig.gz files per chromosome

Call SVs

Algorithm Overview and Advanced Parameters

Deletions

Insertions

Inversions

Translocations

Duplications

From split reads

From insertion

Calling and Genotyping

Filtering

Performance benchmarks

FAQ

To where do I report bugs and ask questions about the pre-release version of pbsv?

Generate separate `.svsig.gz` files per chromosome

To where do I report bugs and ask questions about the pre-release version of `pbsv`?