SkillAgentSearch skills...

CSP2

A Nextflow pipeline for fast and accurate SNP distance estimation from WGS read data or genome assemblies

Install / Use

/learn @CFSAN-Biostatistics/CSP2
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="center"> <img src="img/Temp_Logo.jpg" alt="drawing" width="400"/> </p>

CFSAN SNP Pipeline 2 (CSP2)

Dr. Robert Literman

Office of Analytics and Outreach
Center for Food Safety and Applied Nutrition
US Food and Drug Administration

Current Release: v.0.9.5 (Oct-17-2024)
Last Push: Oct-17-2024

Important Note: CSP2 is currently under development, and has not been validated for non-research purposes. Current workflows and data processing parameters may change prior to full release version.

CSP2 is a Nextflow pipeline for rapid, accurate SNP distance estimation from assembly data

CSP2 runs on Unix, with the handful of dependencies listed in the Software Dependencies section. CSP2 was developed to (1) improve on the speed of the CFSAN SNP Pipeline (CSP), (2) to reduce computational burden when analyzing larger isolate clusters, and (3) to remove the dependency for raw Illumina data. CSP2 relies on the accurate and rapid alignment of genome assemblies provided by MUmmer, which typically complete within seconds. This provides significant reductions in runtime compared to methods that rely on read mapping. The use of assemblies in place of sequencing data also means that:

  • the amount storage needed can be substantially reduced,
  • significantly less computational resources are required,
  • as long as assemblies are available, isolates can be compared regardless of sequencing platform or whether publicly available sequence data even exists

CSP2 runs are managed via Nextflow, providing the user with an array of customizations while also facilitating module development and additions in future releases.

Important Note: The software continues to be focused on the analysis of groups of bacterial genomes with limited evolutionary differences (<1000 SNPs). Testing is underway to determine how the underlying cluster diversity impacts distances estimates.

CSP2 has two main run modes (See Examples):

1) "Screening Mode" (--runmode screen): Used to determine whether query isolates are close to a set of reference isolates (e.g., lab control strains, strains related to an outbreak, etc.)

Given one or more user-provided reference isolates (--ref_reads; --ref_fasta), get alignment statistics and SNP distances between all reference and query isolates (--reads; --fasta)

2) "SNP Pipeline Mode" (--runmode snp): Used to generate pairwise distances and alignments for a set of query isolates

Generate pairwise SNP distances and alignments for 2+ isolates (--reads; --fasta) based on comparisons to:

  • One or more user-provided references (--ref_reads; --ref_fasta), or
  • One or more reference isolates selected by RefChooser (--n_ref)

All CSP2 sequence comparisons happen at the assembly level, but if reads are provided CSP2 will perform a genome assembly using SKESA. In either case, CSP2 then calls MUMmer for alignment. If a sufficient portion of the reference genome is aligned (--min_cov), that data is passed through a set of filters that largely mimic those from the CFSAN SNP Pipeline, including the automated removal of:

  • Sites from short alignments (--min_len)
  • Sites from poorly aligned contigs (--min_iden)
  • Sites close to the contig edge (--query_edge/--ref_edge)
  • Sites from regions of high SNP density (--dwin/--wsnps)
  • Multiply aligned sites
  • Non-base sites (e.g., 'N' or '?')
  • Heterozygous sites
  • Indels (for now)

This final dataset is summarized into a .snpdiffs file, which contains:

  1. A one-line header with alignment statistics
  2. A BED file of contig mappings that pass QC
  3. Information about SNPs (if present)

To avoid unnecessary realignment, once a .snpdiffs file is generated under a particular set of QC parameters (which is hardcoded into the .snpdiffs file as the "QC_String") these files can be used in other CSP2 runs via the --snpdiffs argument (if using the same QC parameters).


Software Dependencies

The following software are required to run CSP2. Software version used during CSP2 development noted in parentheses.


Installing CSP2

CSP2 can be installed by cloning the GitHub repo and configuring the nextflow.config and profiles.config to suit your needs

git clone https://github.com/CFSAN-Biostatistics/CSP2.git

Tips for configuring CSP2

CSP2 options can be specified on the command line, or through the Nextflow configuration files detailed in the next section. Feel free to skip this section if you're familiar with editing Nextflow configuration files.

There are two main configuration files associated with CSP2:

  • The profiles.config file is where you add custom information about your computing environment, but you can also set parameters here as well. An example configuration setup (slurmHPC) is provided as a model.

  • In this example profile, access to the required programs relies on the loading of modules. However, there is no need to specify a module for Python, MUMmer, SKESA, bedtools, or MASH if those programs are already in your path.

profiles {
    standard {
        process.executor = 'local'
        params.cores = 1
        params.python_module = ""
        params.mummer_module = ""
        params.skesa_module = ""
        params.bedtools_module = ""
        params.mash_module = ""
        params.bbtools_module = ""
    }
    slurmHPC {
        process.executor = 'slurm'
        params.cores = 20
        params.python_module = "/nfs/software/modules/python/3.8.1"
        params.mummer_module = "/nfs/software/modules/mummer/4.0.0"
        params.skesa_module = "/nfs/software/modules/skesa/2.5.0"
        params.bedtools_module = "/nfs/sw/Modules/bedtools"
        params.bbtools_module = "/nfs/software/modules/bbtools/38.94"
        params.mash_module = "/nfs/software/modules/mash/2.3"
        params.trim_name = "_contigs_skesa"
    }
}
  • If you plan to run CSP2 locally, be sure to edit params.cores in the standard profile to match the available cores on your system
  • If you add your own profile, be sure to note it on the command line (one hypen)
nextflow run CSP2.nf -profile myNewProfile <args>
  • The nextflow.config file is where you can change other aspects of the CSP2 run, including data location, QC parameters, and all the options listed below:

Options with defaults include:

| Parameter | Description | Default Value | |------------------|------------------------------------------------------------------------------------------------------------|-------------------------------------------| | --outroot | Base directory to create output folder | $CWD | | --out | Name of the output folder to create (must not exist) | CSP2_(java.util.Date().getTime()) | | --forward | Full file extension for forward/left reads of query | _1.fastq.gz | | --reverse | Full file extension for reverse/right reads of reference | _2.fastq.gz | | --ref_forward | Full file extension for forward/left reads of reference | _1.fastq.gz | | --ref_reverse | Full file extension for reverse/right reads of reference | _2.fastq.gz | | --readext | Extension for single-end reads for query | fastq.gz | | --ref_readext | Extension for single-end reads for reference | fastq.gz | | --min_cov | Do not analyze queries that cover less than <min_cov>% of the reference assembly | 85 | | --min_iden | Only consider alignments where the percent identity is at least <min_iden>% | 99 | | --min_len | Only consider alignments that span at least <min_len>bp | 500 | | --dwin | A comma-separated list of windows to check SNP densities | 1000,125,15 | | --wsnps | The maximum number of SNPs allowed in the corresponding window from --dwin | 3,2,1

Related Skills

View on GitHub
GitHub Stars9
CategoryDevelopment
Updated4d ago
Forks5

Languages

Python

Security Score

70/100

Audited on Apr 1, 2026

No findings