Longshot
diploid SNV caller for error-prone reads
Install / Use
/learn @pjedge/LongshotREADME
longshot
Longshot is a variant calling tool for diploid genomes using long error prone reads such as Pacific Biosciences (PacBio) SMRT and Oxford Nanopore Technologies (ONT). It takes as input an aligned BAM/CRAM file and outputs a phased VCF file with variants and haplotype information. It can also genotype and phase input VCF files. It can output haplotype-separated BAM files that can be used for downstream analysis. Currently, it only calls single nucleotide variants (SNVs), but it can genotype indels if they are given in an input VCF.
citation
If you use Longshot, please cite the publication:
supported operating systems
Longshot has been tested using Ubuntu 16.04 and 18.04, CentOS 6.6, Manjaro Linux 17.1.11, and Mac OS 10.14.2 Mojave. It should work on any linux-based system that has Rust and Cargo installed.
dependencies
- rust >= 1.40.0
- zlib >= 1.2.11
- xz >= 5.2.3
- clangdev >= 7.0.1
- gcc >= 7.3.0
- libc-dev
- make
- various rust dependencies (automatically managed by cargo)
(older versions may work but have not been tested)
installation
installation using Bioconda
It is recommended to install Longshot using Bioconda:
conda install longshot
This method supports Linux and Mac. If you do not have Bioconda, you can install it with these steps: First, install Miniconda (or Anaconda). Miniconda can be installed using the scripts here.
The Bioconda channel can then be added using these commands:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
manual installation using apt for dependencies (Ubuntu 18.04)
If you are using Ubuntu 18.04, you can install the dependencies using apt. Then, the Rust cargo package manager is used to compile Longshot.
sudo apt-get install cargo zlib1g-dev xz-utils \
libclang-dev clang cmake build-essential curl git # install dependencies
git clone https://github.com/pjedge/longshot # clone the Longshot repository
cd longshot # change directory
cargo install --path . # install Longshot
export PATH=$PATH:/home/$USER/.cargo/bin # add cargo binaries to path
Installation should take around 4 minutes on a typical desktop machine and will use between 400 MB (counting cargo) and 1.2 GB (counting all dependencies) of disk space.
It is recommended to add the line export PATH=$PATH:/home/$USER/.cargo/bin to the end of your ~/.bashrc file so that the longshot binary is in the PATH for future shell sessions.
usage:
After installation, execute the longshot binary as so:
$ longshot [FLAGS] [OPTIONS] --bam <BAM/CRAM> --ref <FASTA> --out <VCF>
execution on an example dataset
The directory example_data contains a simulated toy dataset that can be used to test out Longshot:
- Reference genome containing 3 contigs each with length 200 kb (
example_data/genome.fa) - 30x coverage simulated pacbio reads generated using SimLoRD (
example_data/pacbio_reads_30x.bam) - The 714 "true" variants for validation (
example_data/ground_truth_variants.vcf)
Run Longshot on the example data as so:
longshot --bam example_data/pacbio_reads_30x.bam --ref example_data/genome.fa --out example_data/longshot_output.vcf
Execution should take around 5 to 10 seconds on a typical desktop machine. The output can be compared to ground_truth_variants.vcf for accuracy.
command line options
$ longshot --help
Longshot: variant caller (SNVs) for long-read sequencing data
USAGE:
longshot [FLAGS] [OPTIONS] --bam <BAM/CRAM> --ref <FASTA> --out <VCF>
FLAGS:
-A, --auto_max_cov Automatically calculate mean coverage for region and set max coverage to mean_coverage +
5*sqrt(mean_coverage). (SLOWER)
-S, --stable_alignment Use numerically-stable (logspace) pair HMM forward algorithm. Is significantly slower but
may be more accurate. Tests have shown this not to be necessary for highly error prone
reads (PacBio CLR).
-F, --force_overwrite If output files (VCF or variant debug directory) exist, delete and overwrite them.
-x, --max_alignment Use max scoring alignment algorithm rather than pair HMM forward algorithm.
-n, --no_haps Don't call HapCUT2 to phase variants.
--output-ref print reference genotypes (non-variant), use this option only in combination with -v
option.
-h, --help Prints help information
-V, --version Prints version information
OPTIONS:
-b, --bam <BAM> sorted, indexed BAM file with error-prone reads (CRAM files also supported)
-f, --ref <FASTA> indexed FASTA reference that BAM file is aligned to
-o, --out <VCF> output VCF file with called variants.
-r, --region <string> Region in format <chrom> or <chrom:start-stop> in which to call variants
(1-based, inclusive).
-v, --potential_variants <VCF> Genotype and phase the variants in this VCF instead of using pileup
method to find variants. NOTES: VCF must be gzipped and tabix indexed or
contain contig information. Use with caution because excessive false
potential variants can lead to inaccurate results. Every variant is used
and only the allele fields are considered -- Genotypes, filters,
qualities etc are ignored. Indel variants will be genotyped but not
phased. Structural variants (length > 50 bp) are currently not supported.
-O, --out_bam <BAM> Write new bam file with haplotype tags (HP:i:1 and HP:i:2) for reads
assigned to each haplotype, any existing HP and PS tags are removed
-c, --min_cov <int> Minimum coverage (of reads passing filters) to consider position as a
potential SNV. [default: 6]
-C, --max_cov <int> Maximum coverage (of reads passing filters) to consider position as a
potential SNV. [default: 8000]
-q, --min_mapq <int> Minimum mapping quality to use a read. [default: 20]
-a, --min_allele_qual <float> Minimum estimated quality (Phred-scaled) of allele observation on read to
use for genotyping/haplotyping. [default: 7.0]
-y, --hap_assignment_qual <float> Minimum quality (Phred-scaled) of read->haplotype assignment (for read
separation). [default: 20.0]
-Q, --potential_snv_cutoff <float> Consider a site as a potential SNV if the original PHRED-scaled QUAL
score for 0/0 genotype is below this amount (a larger value considers
more potential SNV sites). [default: 20.0]
-e, --min_alt_count <int> Require a potential SNV to have at least this many alternate allele
observations. [default: 3]
-E, --min_alt_frac <float> Require a potential SNV to have at least this fraction of alternate
allele observations. [default: 0.125]
-L, --hap_converge_delta <float> Terminate the haplotype/genotype iteration when the relative change in
log-likelihood falls below this amount. Setting a larger value results in
faster termination but potentially less accurate results. [default:
0.0001]
-l, --anchor_length <int> Length of indel-free anchor sequence on the left and right side of read
realignment window. [default: 6]
-m, --max_snvs <int> Cut off variant clusters after this many variants. 2^m haplotypes must be
aligned against per read for a variant cluster of size m. [default: 3]
-W, --max_window <int> Maximum "padding" bases on either side of variant realignment window
[default: 50]
-I, --max_cigar_indel <int> Throw away a read-variant during allelotyping if there is a CIGAR indel
(I/D/N) longer than this amount in its window. [default: 20]
-B, --band_width <Band width> Minimum width of alignment band. Band will increase in size if sequences
are different lengths. [default: 20]
-D, --density_params <string> Parameters to flag a variant as part of a "dense cluster". Format
<n>:<l>:<gq>. If there are at least n variants within l base pairs with
genotype quality >=gq,
Related Skills
node-connect
341.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.4kCommit, push, and open a PR
