PERF

Introduction

PERF is a Python package developed for fast and accurate identification of microsatellites from DNA sequences. Microsatellites or Simple Sequence Repeats (SSRs) are short tandem repeats of 1-6nt motifs. They are present in all genomes, and have a wide range of uses and functional roles. The existing tools for SSR identification have one or more caveats in terms of speed, comprehensiveness, accuracy, ease-of-use, flexibility and memory usage. PERF was designed to address all these problems.

PERF is a recursive acronym that stands for "PERF is an Exhaustive Repeat Finder". It is compatible with both Python 2 (tested on Python 2.7) and 3 (tested on Python 3.5). Its key features are:

Fast run time. As an example, identification of all SSRs from the entire human genome takes less than 7 minutes. The speed can be further improved ~3 to 4 fold using PyPy (human genome finishes in less than 2 minutes using PyPy v5.8.0)
Linear time and space complexity (O(n))
Identifies perfect SSRs
100% accurate and comprehensive - Does not miss any repeats or does not pick any incorrect ones
Easy to use - The only required argument is the input DNA sequence in FASTA format
Flexible - Most of the parameters are customizable by the user at runtime
Repeat cutoffs can be specified either in terms of the total repeat length or in terms of number of repeating units
TSV output and HTML report. The default output is an easily parseable and exportable tab-separated format. Optionally, PERF also generates an interactive HTML report that depicts trends in repeat data as concise charts and tables

Change log

[0.4.6] - 2021-04-22

Fixes

Fixed usage of unit options file input for fastq input.
Fixed usage of repeats input file.

[0.4.5] - 2020-05-07

Added

Annotation of repeats w.r.t to genomic context using a GFF or GTF file. (option -g).
Multi-threading. Parallel identification of repeats in different sequences.
Identification of perfect repeats in fastq files.
Analysis report for repeats in fastq files.
Option to identify atomic repeats.

Changed

Analysis report rebuilt with Semantic ui and Apex Charts.
Visualisation of repeat annotation data in analysis report.

Fixes

Python2 compatability fixed.
Bug fixes for PyPi compatability.
Import error issues.

Installation

PERF can be directly installed using pip with the package name perf_ssr.

$ pip install perf_ssr

This name was chosen for the package so as not to clash with the existing perf package.

Alternatively, it can be installed from the source code:

# Download the git repo
$ git clone https://github.com/RKMlab/perf.git

# Install
$ cd perf
$ python setup.py install

Both of the methods add a console command PERF, which can be executed from any directory. It can also be used without installation by running the core.py file in the PERF subfolder:

$ git clone https://github.com/RKMlab/perf.git
$ cd perf/PERF
$ python core.py -h # Print the help message of PERF (see below)

Usage

The help message and available options can be accessed using

$ PERF -h # Short option
$ PERF --help # Long option

which gives the following output

usage: core.py [-h] -i <FILE> [-o <FILE>] [--format <STR>] [--version]
               [-rep <FILE>] [-m <INT>] [-M <INT>] [-s <INT>] [-S <FLOAT>]
               [--include-atomic] [-l <INT> | -u INT or FILE] [-a] [--info]
               [-g <FILE>] [--anno-format <STR>] [--gene-key <STR>]
               [--up-promoter <INT>] [--down-promoter <INT>]
               [-f <FILE> | -F <FILE>] [-t <INT>]

Required arguments:
  -i <FILE>, --input <FILE>
                        Input sequence file.

Optional arguments:
  -o <FILE>, --output <FILE>
                        Output file name. Default: Input file name + _perf.tsv
  --format <STR>        Input file format. Default: fasta, Permissible: fasta,
                        fastq
  --version             show program's version number and exit
  -rep <FILE>, --repeats <FILE>
                        File with list of repeats (Not allowed with -m and/or
                        -M)
  -m <INT>, --min-motif-size <INT>
                        Minimum size of a repeat motif in bp (Not allowed with
                        -rep)
  -M <INT>, --max-motif-size <INT>
                        Maximum size of a repeat motif in bp (Not allowed with
                        -rep)
  -s <INT>, --min-seq-length <INT>
                        Minimum size of sequence length for consideration (in
                        bp)
  -S <FLOAT>, --max-seq-length <FLOAT>
                        Maximum size of sequence length for consideration (in
                        bp)
  --include-atomic      An option to include factor atomic repeats for minimum
                        motif sizes greater than 1.
  -l <INT>, --min-length <INT>
                        Minimum length cutoff of repeat
  -u INT or FILE, --min-units INT or FILE
                        Minimum number of repeating units to be considered.
                        Can be an integer or a file specifying cutoffs for
                        different motif sizes.
  -a, --analyse         Generate a summary HTML report.
  --info                Sequence file info recorded in the output.
  -f <FILE>, --filter-seq-ids <FILE>
                        List of sequence ids in fasta file which will be
                        ignored.
  -F <FILE>, --target-seq-ids <FILE>
                        List of sequence ids in fasta file which will be used.
  -t <INT>, --threads <INT>
                        Number of threads to run the process on. Default is 1.

Annotation arguments:
  -g <FILE>, --annotate <FILE>
                        Genic annotation input file for annotation, Both GFF
                        and GTF can be processed. Use --anno-format to specify
                        format.
  --anno-format <STR>   Format of genic annotation file. Valid inputs: GFF,
                        GTF. Default: GFF
  --gene-key <STR>      Attribute key for geneId. The default identifier is
                        "gene". Please check the annotation file and pick a
                        robust gene identifier from the attribute column.
  --up-promoter <INT>   Upstream distance(bp) from TSS to be considered as
                        promoter region. Default 1000
  --down-promoter <INT>
                        Downstream distance(bp) from TSS to be considered as
                        promoter region. Default 1000

The details of each option are given below:

`-i or --input`

Expects: FILE<br> Default: None<br> This is the only required argument for the program. The input file must be a valid FASTA/FASTQ file. PERF uses Biopython's FASTA parser to read the input fasta files. It accepts both single-line and multi-line sequences. Files with multiple sequences are also valid. To see more details about the FASTA format, see this page.

`-o or --output`

Expects: STRING (to be used as filename)<br> Default: Input Filename + _perf.tsv (see below)<br> If this option is not provided, the default output filename will be the same as the input filename, with its extension replaced with '_perf.tsv'. For example, if the input filename is my_seq.fa, the default output filename will be my_seq_perf.tsv. If the input filename does not have any extension, _perf.tsv will be appended to the filename. Please note that even in the case of no identified SSRs, the output file is still created (therefore overwriting any previous file of the same name) but with no content in the file.

Output for fasta

The output is a tab-delimited file, with one SSR record per line. The output columns follow the BED format. The details of the columns are given below:

| S.No | Column | Description | |:----:| ------ | ----------- | | 1 | Chromosome | Chromosome or Sequence Name as specified by the first word in the FASTA header | | 2 | Repeat Start | 0-based start position of SSR in the Chromosome | | 3 | Repeat Stop | End position of SSR in the Chromosome | | 4 | Repeat Class | Class of repeat as grouped by their cyclical variations | | 5 | Repeat Length | Total length of identified repeat in nt | | 6 | Repeat Strand | Strand of SSR based on their cyclical variation | | 7 | Motif Number | Number of times the base motif is repeated | | 8 | Actual Repeat | Starting sequence of the SSR irrespective of Repeat class and strand|

An example output showing some of the largest repeats from Drosophila melanogaster is given below

X       22012826  22014795  ACTGGG  1969    -       328     TCCCAG
2RHet   591337    591966    AATACT  629     -       104     ATTAGT
4       1042143   1042690   AAATAT  547     +       91      AAATAT
2RHet   598244    598789    AATACT  545     -       90      AGTATT
XHet    122       663       AGAT    541     +       135     GATA
X       22422335  22422827  AGAT    492     +       123     GATA
3R      975265    975710    AAAT    445     -       111     TTAT
X       15442288  15442724  ACAGAT  436     +       72      ACAGAT
2L      22086818  22087152  AATACT  334     -       55      TATTAG
YHet    137144    137466    AAGAC   322     -       64      CTTGT

Output for fastq

The output is a tab-delimited file, with data on each repeat class per line. | S.No | Column | Description | |:----:| ------ | ----------- | | 1 | Repeat Class | Class of repeat as grouped by their cyclical variations | | 2 | Number of reads | Number of reads having an instan

Perf

Install / Use

README

PERF

Introduction

Change log

[0.4.6] - 2021-04-22

Fixes

[0.4.5] - 2020-05-07

Added

Changed

Fixes

Installation

Usage

`-i or --input`

`-o or --output`

Output for fasta

Output for fastq