PCRamp

Software for designing multiplex-compatible, PCR-based enrichment assays

Overview

The PCRamp program designs mulitplex PCR assays for amplicon sequencing-based target enrichment.

Given one or more target nucleic acid sequences, PCRamp iteratively designs the requested number of multiplex-compatible PCR primers to amplify non-overlapping regions of the target sequences. At each iteration, the design algorithm selects a PCR primer pair that (a) satisfies all specified design constraints (based on melting temperature, hairpin formation, G+C content, length, etc.) and (b) provides the best enrichment by amplifing the largest number of target sequences. This strategy preferentially selects the most conserved regions of the target sequences for amplification.

Optionally, users can also specify one or more background nucleic acid sequences, which must not be amplified by a pair of PCR primers.

The output of the PCRamp program is a list of PCR primer pairs. Associated with each primer pair is a list of target sequences that are predicted to be amplified by this primer pair.

Design strategy - how does PCRamp work?

The PCRamp program attempts to find highly conserved, target-specific, multiplex compatible PCR primers. There are many target enrichment scenarios that the PCRamp program seeks to address, including:

Gene target enrichment: Challenges include potentially large numbers (tens of thousands) of moderately diverse target sequences. For many genes, the target sequences are short (less than 5 kb) and there may be closely related nearest neighbors.
Viral target enrichment: Challenges include potentially large numbers (tens of thousands) of highly diverse target sequences. However, for many viruses, the target genomes sequences are fairly short (less than 50 kb) and the nearest neighbors are usually not closely related (which often means we do not need to provide background sequences to ensure PCR specificity). Degenerate nucleotides may be needed to cope with high sequence diversity.
Bacterial target enrichment: Challenges include fairly large genome sequence lengths (around 5 mb) and often closely related near-neighbors. However, most bacterial targets have relatively few available genome sequences (typically less than five thousand).

The following algorithmic features are intended to address these challenges within a single assay design program:

A random-sampling approach to designing assays that avoids the need to load all of the target and background genomes into main memory at the same time. The target coverage (i.e. inclusivity; number of targets amplified) and specificity (i.e. avoiding amplification of background sequences) of the randomly selected PCR assays are then improved with optional local optimization steps. The downside to this approach is that assay designs are no longer deterministic - it is expected that a different set of PCR primers for the same set of targets and background sequences will be generated every time PCRamp is run.
A combination of heuristic and physics-based rules for predicting successful PCR.
Where possible, previously designed PCR primers are "reused" to increase the number of diverse target sequences that can be enriched while minimizing the total number of primer oligos required for the multiplex PCR.
Efficient, highly parallel implementation that uses multiple processors (via MPI), multiple cores (via OpenMP) and data parallel (SIMD) instructions for x86 CPUs.

Building and installing PCRamp

PCRamp is written in C++ and requires a C++ compiler and a local installation of MPI ("Message Passing Interface") to compile. PCRamp was developed and tested using OpenMPI with both the gnu (on Linux) and clang (on OS X) C++ compilers. The included Makefile is intentionally very simple (and hopefully easy to read). With the exception of zlib (included on most Unix-like systems; needed for the on-the-fly reading of compressed fasta files), there are no other software dependencies. After downloading the PCRamp source code files, running the make command should build the pcramp program. There is no formal install process, but the pcramp program can be manually copied to any desired location. Please keep in mind that when running on a cluster computer with MPI, the version of MPI used to compile PCRamp must be the same as the mpirun program.

Running PCRamp

PCRamp uses MPI to run on a cluster computer and will automatically use all of the available CPU cores on each compute node. As a result, MPI and any cluster scheduling software (i.e. slurm, UGE, SGE, torque, ...) should be configured to run a single instance of PCRamp on each computer in the cluster and let each instance use all of the available CPU cores.

When running on multiple computers, PCRamp is typically envoked via a batch script (submitted to a cluster scheduler) or the command line (when running in interactive mode) using some variant of mpirun <MPI options> pcramp <PCRamp options>. The particular MPI options will depend on the configuration of each particular cluster computer.

PCRamp can also run on a single computer (e.g. workstation, laptop). In this case, simply omit the mpirun and directly invoke pcramp from a script or the command line. By default, pcramp will use all available threads (unless the --thread option is provided to limit the number of threads).

Input sequence formats

Target sequences (for which to be enriched) and optional background sequences (that are not to be amplified) are provided to PCRamp in the fasta file format. These fasta files may optionally be compressed using the gzip program (and they will automatically be decompressed "on-the-fly" to keep filesystem space requirements to a minimum). All fasta files must have one of the following file extensions: .fna, .fasta or .fa (with an optional .gz suffix to indication compression).

PCRamp accommodates two of the common ways that biological sequences are stored in filesystems, each with its own command line flag:

For relatively short sequences (like single-segment viruses and genes), sets of distinct target sequences are often stored in a single fasta file. A toy example of this arrangement is:

>virus1
ACTAGCGATGCGACGTAGCTAGCAGCGATGCAGCTAGCAGTCGTA
>virus2
ACTAGCGCATGCGACGTAGCTAGCAGCGAGCAGCTAGACAGTCGTAGCTA
>virus3
ATGCGATCATAGCGATGCGACGTAGTAGCAGCGATGAGCTAGCAGTCGTA

To read multiple, separate targets from a single fasta file, specify the fasta file on the command line with the lower case -t flag. For example, pcramp -t <fasta file of targets> .... If the target sequences are contained in multiple fasta files, then the -t flag may be repeated on the command line, pcramp -t <fasta file of targets 1> -t <fasta file of targets 2>. Similarly, if distinct background sequences are stored in a single fasta file, they would be specified using pcramp -b <fasta file of backgrounds>. As with the -t flag, the -b flag may also be repeated to specify additional background sequences.

For longer sequences, as well as multi-segment viruses and multi-chromosomal bacteria, it is common to store one or more sequences/segments/chromosomes from an individual target in different files. All of the sequences found in a same directory as assumed to belong to the same target. For example, a directory listing of different Bacillus genomes might look like:

Bacillus/
	anthracis/
		Bacillus_anthracis_str._Ames_7845/
			NC_003997.3.fna.gz
		Bacillus_anthracis_str.__Ames_Ancestor__8445/
			NC_007322.2.fna.gz
			NC_007323.3.fna.gz
			NC_007530.2.fna.gz
		Bacillus_anthracis_str._Australia_94_167335/
			wgs.AAES.1.fna.gz
	cereus/
		Bacillus_cereus_G9241_167215/
			wgs.AAEK.1.fna.gz
		Bacillus_cereus_MC118_399245/
			wgs.AHEM.1.fna.gz
	thuringiensis/
		Bacillus_thuringiensis_serovar_tochigiensis_BGSC_4Y1_161475/
		 	wgs.ACMY.1.fna.gz
		Bacillus_thuringiensis_serovar_wuhanensis_2147375/
		 	wgs.NFEE.1.fna.gz
		Bacillus_thuringiensis_str._Al_Hakam_15065/
			NC_008598.1.fna.gz
			NC_008600.1.fna.gz

To read multiple, separate targets, where each target is a collection of one or more fasta files in a single directory, specify the directories (or parent directories) on the command line with the capital -T flag. For example, pcramp -T Bacillus/anthracis/Bacillus_anthracis_str.__Ames_Ancestor__8445 will load all of the data in fasta files contained in the Bacillus/anthracis/Bacillus_anthracis_str.__Ames_Ancestor__8445/ directory (i.e. the chromosome and plasmid sequences stored in NC_007322.2.fna.gz, NC_007323.3.fna.gz and NC_007530.2.fna.gz).

As another example, the command pcramp -T Bacillus/anthracis will recursively search the Bacillus/anthracis directory to load the three B. anthracis genomes Bacillus_anthracis_str._Ames_7845, Bacillus_anthracis_str.__Ames_Ancestor__8445 and Bacillus_anthracis_str._Australia_94_167335. The sequences in each subdirectory will be associated with the three respective B. anthracis genomes. The ability to load all subdirectories is useful when the directory structure mirrors genome taxonomy. However, directories can still be specified individually. For example, pcramp -T Bacillus/Bacillus_anthracis_str._Ames_7845 -T Bacillus/Bacillus_anthracis_str._Australia_94_167335 will load the Bacillus_anthracis_str._Ames_7845 and Bacillus_anthracis_str._Australia_94_167335 but would not include the Bacillus_anthracis_str.__Ames_Ancestor__8445 genome.

Similar to the loading of target sequences that are stored in separate directories, background sequences that are stored in one or more files in a directory can be specified to PCRamp with the capital -B command line flag.

When loading sequences that are grouped by directory (as in the above example of Bacillus genomes), PCRamp allows a shared target or background directory prefix to be specified using the --T.prefix <path> (for targ

PCRamp

Install / Use

README

PCRamp

Overview

Design strategy - how does PCRamp work?

Building and installing PCRamp

Running PCRamp

Input sequence formats

Related Skills