GBSapp
Automated Pipeline for Variant/Haplotype Calling and Filtering
Install / Use
/learn @bodeolukolu/GBSappREADME
Introduction
GBSapp v2.2 is an automated pipeline for variant calling and filtering. The pipeline intuitively integrates existing/novel best practices, some of which can be controlled by user-defined parameters. It optimizes memory and speed at various steps of the pipeline, for example, a novel approach performs compression and decompression of unique reads before and after read alignment, respectively. Summary reports and visualizations allow for QC at each step of the pipeline.
For questions, bugs, and suggestions, please contact bolukolu@utk.edu.
Features
- Easy use and designed for biologist.
- Dosage-based variant calling and filtering.
- Fully-automated: “walk-away” and “walk-through” mode.
- Allows for use of haplomes (haplotype-resolved), subgenomes (haploid), and pan-genomes (haploid or haplotype-resolved) references.
- Minimizes excess heterozygosity and allele dropout.
- Variant calling implemented for up to ploidy of 8.
- Input data: shotgun WGS, reduced representation sequence (e.g., OmeSeq-qRRS, GBS, ddRADseq), RNAseq and multiplexed-PCR data.
- Can subsample shotgun whole genome data for variant calling, i.e. in silico reduced representation sequencing (RRS).
- Parallelization of job on multiple compute cluster nodes (spark cluster infrastructure not required).
- Splice-aware aligner (STAR) allows for RNAseq data as input (recommended only for haploid or diploid genomes).
- Generates variant sequence context (useful for applications such as oligo/primer design & sequenced-based phylogenetic analysis).
- Variant calling delineates SNP from uniquely mapped and paralogs.
- Increase speed of variant calling based on dynamic downsampling (avoids allele dropout due to biased downsampling).
- Fast alignment due to joint-alignment method.
- Visualizations for report and QC.
- calling micro- and macro-haplotypes
- estimating ploidy level (plots ploidy profile along chromosomes).
Contents
- Installation
- Usage
- Related Software
- Select Article Referencing GBSapp
- Acknowledgment
- Troubleshooting
- Versioning
- License
Installation
- Currently, GBSapp is only available for unix-based systems.
- Clone or download the Git repository to your desired folder.
git clone https://github.com/bodeolukolu/GBSapp.git
- Installation occurs automatically the first time you run the pipeline.
- To install dependencies without running a job:
GBSapp_dir/GBSapp install
- With the exception of R, and Python (v2.6 or greater) all dependencies are installed to a local directory within GBSapp.
Dependencies:<br />
Installed on first run of pipeline:
-----------------------------------
minnimap2, STAR, bwa, samtools, picard, bcftools, GATK, java, R-ggplot2, CMplot, and R-AGHmatrix
Pre-install before running GBSapp:
----------------------------------
- R
- python v3.7 or greater
Usage
Basic Usage
The project directory should contain the following files and directories:
- config file: specifies run parameters (for details: GBSapp_vx.x/examples/config).
- samples directory: contains “se” and/or “pe” (“paired” and “single” name format acceptable) fastq file(s). Paired-end (pe) sample filenames might require formatting so that they end in “_R1.fastq” or “.R1.fastq” and “_R2.fastq” or “.R2.fastq” (for details: GBSapp_vx.x/misc/format_fastq_filenames.txt).
- refgenomes directory: contains fasta file(s) of the reference genome.<br />* <u>Genomes with subgenome assemblies in single fasta file</u>: such as allopolyploids and segmental allopolyploids might require formatting to split fasta file into multiple file containing each subgenome (for details: GBSapp_vx.x/misc/split_subgenomes_format_fasta_headers.txt).<br />* <u>Haploids, diploids and autopolyploids with single reference genomes</u>: splitting fasta file not required.<br />* <u>Note</u>:formatting of fasta headers to contain minimal text (e.g. >Chr05) might be required (for details: GBSapp_v0.1/misc/format_fasta_headers.txt)
- for help: --help or -h
- for version: --version or -v
From command line, run GBSapp with options shown below (absolute or relative path)
$ bash <path-to-GBSapp-directory/GBSapp> <path-to-project-directory>
Project directory setup
A project directory should contain the following sub-directories:
- samples folder: this contains your quality filtered sequence data.
- refgenomes folder: this contains your reference genome fasta file.
- config.sh file: a template of the configuration file is provided in the examples folder of the GBSapp download.
Overview of workflow
The figure below outlines the order of steps in the GBSapp pipeline
- In Progress
Configuration
Using a text editor, save a file containing any of the following variables as 'config.sh' file and include it in your project directory.
General parameters
|Variable |Default |Usage |Input |required/Optional| |:-------------|:-------------|:-------------|:-------------|:----------------| |threads|na|number of cores/processors|integer|Optional| |walkaway|true|run in walk-away or walk-through mode|true or false|Optional| |cluster|false|run on compute cluster node (default: slurm) or workstation|true or false|Optional| |nodes|1|number of nodes|integer|Optional| |RNA|false|RNA-seq reads as input (STAR aligner)|true or false|Optional| |variant_caller|gatk|gatk (recommended) or bcftools(diploid only, might be better for some genomes)|string|Optional| |samples_alt_dir|false|links samples in separate directory to project directory|true or false|Optional| |lib_type|RRS|RRS (reduced representation sequence e.g. GBS, ddRADseq, qRRS) or WGS (shotgun whole genome sequence)|string|Optional| |subsample_WGS_in_silico_qRRS|false|Fast alternative to variant calling on whole genome data|false, medium, low|Optional|
Variant calling parameters
|Variable |Default |Usage |Input |required/Optional| |:-------------|:-------------|:-------------|:-------------|:----------------| |ploidy|na|value = 1,2,4,6, or 8|integer|Required| |ref1|na|reference subgenome as .fasta file. Anchor-genome when other pangenomes/subgenomes are provided |integer|Optional| |ploidy_ref1|na|ploidy-level|integer|Optional| |Get_Chromosome|na|variant calling on specific chromosomes, scaffolds,and contigs|comma delimited string(s)|optional| |Exclude_Chromosome|na|variant calling to exclude specific chromosomes, scaffolds,and contigs|comma delimited string(s)|optional|
*note: haploid assemblies of pangenomes and subgenomes should be in individual fasta files *note: short prefix for pangenome/subgenome pseudomolecules should be unique (i.e. >TF_Chr01 and >TL_Chr01 fasta sequence header for Ipomoea trifida and I. triloba, respectively) *note: designate chromosomes of haplotype-resolved/subgenome assemblies with single character suffix (alphabets: A-Z and a-z) e.g. Chr01A
Variant filtering parameters
|Variable |Default |Usage |Input |required/Optional| |:-------------|:-------------|:-------------|:-------------|:----------------| |p1|na|maternal parent (specified only for biparental populations)|string|Optional| |p2|na|paternal parent (specified only for biparental populations)|string|Optional| |biallelic|false|filter to output only biallelic variants|string|Optional| |genotype_missingness|1|maximum proportion of missing genotypes allowed per sample|comma delimited decimal number(s)|Optional| |sample_missingness|1|maximum proportion of missing samples allowed per variant|comma delimited decimal number(s)|Optional| |exclude_samples|na|sample IDs to be exclude from filtered variant data set|comma delimited string(s)|Optional| |select_samples|na|limit variant filtering to samples IDs in file delimited by newline|filename|Optional| |minRD_1x|2|minimum read depth threshold|integer|Optional| |minRD_2x|6|minimum read depth threshold|integer|Optional| |minRD_4x|25|minimum read depth threshold|integer|Optional| |minRD_6x|45|minimum read depth threshold|integer|Optional| |minRD_8x|100|minimum read depth threshold|integer|Optional| |pseg|0.001|p-value threshold for chi-square test of segregation distortion|decimal number|Optional| |maf|0.02|minor allele frequency threshold|decimal number|Optional| |filtered_vcf|false|generate filtered vcf file|string|Optional|
Advanced parameters |Variable |Default |Usage |Input |required/Optional| |:-------------|:-------------|:-------------|:-------------|:----------------| |max_pseudoMol|5000|maximum # of pseudomolecules (scaffold/contig) before stitching into non-contiguous pseudo-chromosomes|integer|Optional| |uniquely_mapped|true|include uniquely mapped for variant calling |string|Optional| |paralogs|false|include paralogs for variant calling |string|Optional| |downsample_2x|50|value for unbiased downsampling for 2x ploidy|integer|Optional| |downsample_4x|100|value for unbiased downsampling for 4x ploidy|integer|Optional| |downsample_6x|150|value for unbiased downsampling for 6x ploidy|integer|Optional| |downsample_8x|200|value for unbiased downsampling for 8x ploidy|integer|Optional| |maxHaplotype|128|maximum # of haplotypes per haploid genome across population(increase for polyploids/high heterozygosity/high background mutational load)|integer|Optional| |use_softclip|false|use soft-clipped bases for
