BaGPipe
A bacterial GWAS pipeline written in Nextflow that uses Pyseer; Part III Project for Systems Biology at the University of Cambridge
Install / Use
/learn @sanger-pathogens/BaGPipeREADME
BaGPipe
Overview
Running a bacterial GWAS isn't that hard, but the pre- and post-processing of data can be boring. In particular, a user must contend with data conversions, estimate computational resource requirements and wait for one process to finish before another can start. These tasks take precious time away from you, and sometimes bring in small errors that impact the reliability of results.
BaGPipe is a nextflow pipeline that integrates a series of bioinformatic tools into a standard, tested workflow for performing bacterial GWAS on large datasets (see Figure below).
What does it do?
The most comprehensive way of running BaGPipe starts with only genome assemblies. In this mode, it generates all other data required as input for bacterial GWAS by:
- Creating k-mers (short DNA sequences of length k) or unitigs (longer, non-overlapping sequences assembled from k-mers)
- Annotating assemblies
- Performing a pangenome analysis and creating a core phylogenetic tree
- Generating a pairwise distance matrix
Users have the flexibility to enter the workflow at alternative starting points in this pre-processing stage, e.g. supplying their own tree.
For the association analysis, Pyseer (Lees et al, 2018) was used for its speed and design in addressing the common problems of bacterial GWAS like population strucutre, recombination rate, and multiple testing. By default, BaGPipe uses of a linear mixed model algorithm and unitigs as the input genotype (options recommended by Pyseer authors).
In the post-processing stage, BaGPipe automatically performs significant unitig analysis. If the user provides reference files, it can conveniently produce Manhattan plots, annotate the significant unitigs, and eventually produce a gene-hit plot. If the user provides an additional genome dataset, it can screen for significant unitigs from the discovery analysis in the provided dataset.
Users can also run AMR prediction on the input genomes. This provides a comparison of GWAS hits with known AMR genes from the AMRFinder database, allowing users to explore novelty within their analyses.
BaGPipe facilitates GWAS analysis on large datasets by easing the computational burden. It optimises the use of requested memory and CPU, which can be customised if necessary. Additionally, an automated resource escalation strategy ensures that a process will be re-run with higher resource requests if the process failed due to lack of memory or runtime on an HPC cluster node.
Requirements
Software
- A Unix-like operating system environment (e.g. Linux, macOS, Windows with WSL) with Bash 3.2 or later.
- Java 11 or later (OpenJDK) as a nextflow dependency
- nextflow as workflow management system
- Docker or Singularity for pipeline dependency management
NOTE: Java and Nextflow can be installed with instructions on https://www.nextflow.io/docs/latest/install.html.
Hardware (recommended)
- >= 16GB RAM
- >= 2 CPUs/cores
- Enough space for images (< 10GB) and intermediate data
Scalability
BaGPipe is scalable. BaGPipe can be run on HPC by exploiting Nextflow parallelism for per-sample steps, but for >5,000 genomes users should expect cohort-wide steps (pangenome/phylogeny/kinship) to dominate runtime and may prefer (i) providing pre-computed inputs (e.g., phylogeny/kinship/pangenome outputs) or (ii) splitting analyses by lineage/clade and combining results downstream, consistent with current Panaroo recommendations.
Getting started
Example command
Running BaGPipe is easy! You just need to specify a few input files (see Inputs below). Some example input is provided in the example_input folder. These example inputs will work with the below commands, provided assemblies available in the Pyseer tutorial are downloaded and paths in the inputs updated accordingly. Note: If you try BaGPipe on this Streptococcus pneumoniae dataset you can replicate some analysis I did in my paper (bioRxiv)!
Different configuration profiles can be specified depending on the compute environment in which BaGPipe is run. In most cases, BaGPipe should be run in an HPC environment, but these environments are specific to host institutions. To run BaGPipe with an nf-core profile appropriate for your institution, find a config file here. If a config file exists for your institution, run it using -profile <institution>. Different profiles can be combined in a list, with later profiles overriding previous ones.
You can also enable different methods for dependency management using profiles. Currently supported profiles for this purpose include: singularity and docker.
For instance to run on an HPC, using singularity to handle pipeline dependencies: (Remember to update the input file paths according to your customised directory structure.)
nextflow run sanger-pathogens/BaGPipe \
-c example_input/cambridge.config \
-profile singularity \
--manifest example_input/genome_manifest.csv \
--genotype_method unitig \
--reference example_input/reference_manifest.tsv \
--genus Streptococcus \
--phenotypes example_input/pheno.tab \
--chosen_phenotype penicillin \
--bakta_db rds/databases/bakta_db_v6/db \
-resume
I test ran succesfully on the Cambridge HPC, you can find the profile config that I used in the example_input.
For people running on the Sanger HPC, you can run like this:
bsub -q oversubscribed -M 4000 -R "rusage[mem=4000] select[mem>4000]" -o test.o -e test.e \
nextflow run sanger-pathogens/BaGPipe \
-profile sanger,singularity \
--manifest genome_manifest.csv \
--genotype_method unitig \
--reference reference_manifest.tsv \
--genus Streptococcus \
--phenotypes pheno.tab \
--chosen_phenotype penicillin \
--bakta_db rds/databases/bakta_db_v6/db
-resume
If you would like to make code changes it may be better to clone this repository and run the pipeline using nextflow run main.nf, instead of pulling directly from GitHub (as in the example command above).
For more options, please explore in the help message (using --help).
Execution and Caching
Users can supply parameters to the pipeline in two ways, via:
- Command Line Interface (CLI), as shown in the example command.
- Nextflow configuration file (similar to nextflow.config)
If a configuration file is specified, the user must establish the paths for all input files and define the specific analysis along with its related parameters.
In case of any error during a run, the user can rectify the error and restart the pipeline using the -resume nextflow option. Doing so utilises cached files from the last run, without the need to re-run everything again from the top.
Inputs
The user must specify, through the option --genotype_method, one of the three variant genotype methods:
- k-mers/unitigs (
unitig) - gene presence or absence (
pa) - SNPs and INDELs (Insertions and Deletions) (
snp)
If unsure, I recommend the unitig approach, as it is the approach recommended by the Pyseer authors. Currently, the unitig option is fully tested but the others are incomplete.
| Input Type | Format | Use Case |
|-------------------|-----------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
| Genomes | A CSV file listing all sample IDs (first column) and the paths to their FASTA files (second column) | Compulsory |
| Phenotypes | A TSV file listing all sample IDs (first column) and whether the isolate belongs to each phenotype (0 or 1, each of the other columns); Choose the phenotype to test with the flag --chosen_phenotype followed by the column header name of that phenotype | Compulsory |
| References | A TSV file listing the paths to reference assemblies (first column) and the paths to their corresponding GFF files (second column) | Significant unitig analysis |
| Annotated genomes | A CSV file listing all sample IDs (first column) and the paths to their GFF files (second column) | If the user prefers their own GFFs in “unitig” mode |
| Phylogenetic tree | A phylogenetic tree file for the pangenome in NEWICK format | If the user prefers their own tree in “unitig” mode |
| Variant file | A CSV file listing all sample IDs (first column) and the paths to their VCF files (second column) | If the user prefers their own VCFs in “snp” mode |
| Merged variant file | A merged VCF file for all samples | If the user prefers their own merged VCFs in “snp” mode |
| Additional genomes for screening | A CSV file listing all sample IDs (first column) and the paths to their FASTA files (second column) |
How it works?
1. Annotation
By default, Bakta (v.1.11.3) (Schwengers,
Related Skills
node-connect
342.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
85.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
342.5kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
