SkillAgentSearch skills...

SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data

Install / Use

/learn @chasewnelson/SNPGenie

README

<img src="https://github.com/chasewnelson/snpgenie/blob/master/logo1a.jpg?raw=true" alt="SNPGenie logo by Elizabeth Ogle" width="450" height="175" align="middle">

SNPGenie is a collection of Perl scripts for estimating π<sub>N</sub>/π<sub>S</sub>, d<sub>N</sub>/d<sub>S</sub>, and gene diversity from next-generation sequencing (NGS) single-nucleotide polymorphism (SNP) variant data. Each analysis uses its own script:

  1. WITHIN-POOL ANALYSIS. Use snpgenie.pl, the original SNPGenie. Analyzes within-sample π<sub>N</sub>/π<sub>S</sub> from pooled NGS SNP data. SNP reports (CLC, Geneious, or VCF) must each correspond to a single population/pool, with variants called relative to one reference sequence (one sequence in one FASTA file). Example:

     snpgenie.pl --vcfformat=4 --snpreport=<variants>.vcf --fastafile=<ref_seq>.fa --gtffile=<CDS_annotations>.gtf
    
  2. WITHIN-GROUP ANALYSIS. Use snpgenie\_within\_group.pl. Traditional within-group analysis of d<sub>N</sub>/d<sub>S</sub> among aligned sequences in a FASTA file, similar to <a target="_blank" href="http://www.megasoftware.net/">MEGA</a>, with automation, sliding windows, parallelism, and overlapping gene features. Example:

     snpgenie_within_group.pl --fasta_file_name=<aligned_seqs>.fa --gtf_file_name=<CDS_annotations>.gtf --num_bootstraps=10000 --procs_per_node=8
    
  3. BETWEEN-GROUP ANALYSIS. Traditional between-group analysis comparing two or more groups (FASTA files) of aligned sequences, such as possible using <a target="_blank" href="http://www.megasoftware.net/">MEGA</a>, with automation, sliding windows, parallelism, and overlapping gene features. Example:

     snpgenie_between_group.pl
    

UPDATE FOR VCF INPUT: given myriad VCF formats, it is NECESSARY TO SPECIFY THE FORMAT OF THE VCF SNP report using the --vcfformat argument. Format 4 is the most common. For details, see the section on VCF.

Contents

<a name="introduction"></a>Introduction

New applications of next-generation sequencing (NGS) use pooled samples containing DNA from multiple individuals to perform population genetic analyses. SNPGenie is a Perl program which analyzes the results of a single-nucleotide polymorphism (SNP) caller to calculate nucleotide diversity (including its nonsynonymous and synonymous partitions, π<sub>N</sub> and π<sub>S</sub>) and gene diversity. Variant calls are typically present in annotation tables and assume that the pooled DNA sample is representative of some population of interest. For example, if one is interested in determining the nucleotide diversity of a virus population within a single host, it might be appropriate to sequence the pooled nucleic acid content of the virus in a blood sample from that host. Comparing π<sub>N</sub> and π<sub>S</sub> for a gene product, or comparing gene diversity at polymorphic sites of different types, may help to identify genes or gene regions subject to positive (Darwinian) selection, negative (purifying) selection, or random genetic drift. SNPGenie also includes such features as minimum allele frequency trimming (see Options), and can be combined with upstream applications such as maximum-likelihood SNP calling techniques (e.g., see Lynch et al. 2014). For additional background, see Nelson & Hughes (2015) in the References.

<a name="snpgenie"></a>SNPGenie

<a name="snpgenie-input"></a>SNPGenie Input

SNPGenie is a command-line application written in Perl, with no additional dependencies beyond the standard Perl package (i.e., simply download and run). Input includes:

  1. One Reference Sequence file, containing one reference sequence, in FASTA format (.fa/.fasta);
  2. One file with CDS annotations in Gene Transfer Format (.gtf); and
  3. One or more tab-delimited (.txt) SNP Reports, each corresponding to a single pooled-sequencing run (i.e., sample from a population) in CLC, VCF, or Geneious formats, with variants called relative to the reference sequence.

All input files must be placed in the same directory. Then simply run the appropriate SNPGenie script from the command line (see Options if you wish for more control). To do this, first download the snpgenie.pl script and place it in your system’s $PATH, or simply in your working directory. Next, place your SNP report(s), FASTA (.fa/.fasta), and GTF (.gtf) files in your working directory. Open the command line prompt (or Terminal) and navigate to the directory containing these files using the "cd" command in your shell. Finally, execute SNPGenie by typing the name of the script and pressing the <RETURN> (<ENTER>) key:

snpgenie.pl

Further details on input and options are below.

<a name="ref-seq"></a>INPUT (1) One Reference Sequence File

Only one reference sequence must be provided in a single FASTA (.fa/.fasta) file (i.e., one file containing one sequence). Thus, all SNP positions in the SNP reports are called relative to the single reference sequence. Because of this one-sequence stipulation, a script has been provided to split a multi-sequence FASTA file into its constituent sequences for multiple or parallel analyses; see Additional Scripts below.

<a name="gtf"></a>INPUT (2) One Gene Transfer Format File

The Gene Transfer Format (.gtf) file is tab (\t)-delimited, and must include non-redundant records for all CDS elements (i.e., open reading frames, or ORFs) present in your SNP report(s). Thus, if multiple records exist with the same gene name (e.g., different transcripts), ONLY ONE may be included in your analysis at a time. The GTF should also include any ORFs which do not contain any variants (if they exist). SNPGenie expects every coding element to be labeled as type "CDS", and for its product name to follow a "gene_id" tag. In the case of CLC and Geneious SNP reports, this name must match that present in the SNP report. If a single coding element has multiple segments (e.g., exons) with different coordinates, simply enter one line for each segment, using the same product name; they will be concatenated. Finally, for sequences with genes on the reverse '-' strand, SNPGenie must be run twice, once for each strand, with the minus strand's own set of input files (i.e., the '-' strand FASTA, GTF, and SNP report); see A Note on Reverse Complement ('-' Strand) Records below. For more information about GTF, please visit <a target="_blank" href="http://mblab.wustl.edu/GTF22.html">The Brent Lab</a>. To convert a GFF file to GTF format, use a tool like <a target="_blank" href="https://ccb.jhu.edu/software/stringtie/gff.shtml#gffread">gffread</a>. A short GTF example follows:

reference.gbk	CLC	CDS	5694	8369	.	+	0	gene_id "ORF1";
reference.gbk	CLC	CDS	8203	8772	.	+	0	gene_id "ORF2";
reference.gbk	CLC	CDS	1465	4485	.	+	0	gene_id "ORF3";
reference.gbk	CLC	CDS	5621	5687	.	+	0	gene_id "ORF4";
reference.gbk	CLC	CDS	7920	8167	.	+	0	gene_id "ORF4";
reference.gbk	CLC	CDS	5395	5687	.	+	0	gene_id "ORF5";
reference.gbk	CLC	CDS	7920	8016	.	+	0	gene_id "ORF5";
reference.gbk	CLC	CDS	4439	5080	.	+	0	gene_id "ORF6";
reference.gbk	CLC	CDS	5247	5549	.	+	0	gene_id "ORF7";
reference.gbk	CLC	CDS	4911	5246	.	+	0	gene_id "ORF8";

<a name="SNP-Reports"></a>INPUT (3) SNP Report File(s)

Each SNP report should contain variant calls for a single pooled-sequencing run (i.e., sample from population), or alternatively a summary of multiple individual sequences. Its site numbers should refer to the exact reference sequence present in the FASTA input. SNP reports must be provided in one of the following formats:

<a name="clc"></a>CLC Genomics Workbench

A <a target="_blank" href="http://www.clcbio.com/products/clc-genomics-workbench/">CLC Genomics Workbench</a> SNP report must include the following columns to be used with SNPGenie, with the unaltered CLC column headers:

  • Reference Position, which refers to the start site of the polymorphism within the reference FASTA sequence;
  • Type, which refers to the nature of the record, usually the type of polymorphism, e.g., "SNV” for single-nucleotide variants;
  • Reference, the reference nucleotide(s) at that site(s);
  • Allele, the variant nucleotide(s) at that site(s);
  • Count, the number of reads containing the variant;
  • Coverage, the total number of sequencing reads at the site(s);
  • Frequency, the frequency of the variant as a percentage, e.g., “14.6” for 14.60%; and
  • Overlapping annotations, containing the name of the protein product or open reading frame (ORF), e.g., “CDS: ORF1”.

In addition to containing the aforementioned columns, the SNP report should ideally be free of thousand separators (,) in the Reference Position, Count, and Coverage columns (default .txt output). The Frequency must remain a percentage (default .txt output). Finally, the user should verify that the reading frame in the CLC input file is correct.

<a name="geneious"></a>Geneious

A <a target="_blank" href="http://ww

View on GitHub
GitHub Stars121
CategoryDevelopment
Updated1mo ago
Forks37

Languages

Perl

Security Score

100/100

Audited on Feb 18, 2026

No findings