NOT MAINTAINED

Use at your own risk. I am very grateful that it is being widely used but, as I completely changed my research area I cannot give my time to maintain this project. There are other, newer packages developed, which can be used instead.

PlasFlow 1.1

PlasFlow is a set of scripts used for prediction of plasmid sequences in metagenomic contigs. It relies on the neural network models trained on full genome and plasmid sequences and is able to differentiate between plasmids and chromosomes with accuracy reaching 96%. It outperforms other available solutions for plasmids recovery from metagenomes and incorporates the thresholding which allows for exclusion of incertain predictions. PlasFlow has been published in Nucleic Acids Research (https://doi.org/10.1093/nar/gkx1321).

News
Requirements
Installation
Getting started
Output
Test dataset
Detailed information
Citation
TBD
Support

News

2018-05-25 Version 1.1 released

New version (1.1) released, which is better suited for large datasets. It can be downloaded from conda and pypi, but the simplest way to upgrade is to replace PlasFlow.py file in you previous installation with the current one. If you still encounter problems with the new version, try to use smaller numbers for the --batch_size option.

Requirements:

Python 3.5
Python packages:
- Scikit-learn 0.18.1
- Numpy
- Pandas
- TensorFlow 0.10.0
- rpy2 >= 2.8
- scipy
- biopython
- dateutil >= 2.5
R 3.25
R packages:
- Biostrings

For the perl scripts, especially filter_sequences_by_length.pl:

Perl 5 and modules:
- Bioperl (installation instructions)
- Getopt

Installation

Conda-based - recommended

Conda is recommended option for installation as it properly resolve all dependencies (including R and Biostrings) and allows for installation without messing with other packages installed. Conda can be used both as the Anaconda, and Miniconda (which is easier to install and maintain).

After the installation it is required to add bioconda channel, required for Biostrings package installation:

conda config --add channels bioconda

Sometimes it can be also required to add default conda channel (conda-forge):

conda config --add channels conda-forge

To exclude the possibility of dependencies conflicts its encouraged to create spearate conda environment for Plasflow using command:

conda create --name plasflow python=3.5

Python 3.5 is required becuase of TensorFlow requirements.

to activate created environment type:

source activate plasflow

Mac users should install Tensorflow at this step (as osx-64 package is not present in default channels). If you encounter any problems with missing TensorFlow dependency on other platforms also try to install TF from this source.

conda install -c jjhelmus tensorflow=0.10.0rc0

PlasFlow can be easily installed as an Anaconda package from my Anaconda channel using:

conda install plasflow -c smaegol

With this command all required dependencies are installed into created conda environment. When installation is finished PlasFlow can be invoked as described in the Getting started section.

When you decide to finish your work with PlasFlow, you can simply deactivate current anaconda environment with command:

source deactivate

Pip installer

There is a possibility of pip based installation. However, some requirements have to be met:

Python 3.5 is required (due to TensorFlow requirements)
TensorFlow has to be installed manually:

pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0rc0-cp35-cp35m-linux_x86_64.whl

then install PlasFlow with

pip install plasflow

However, models used for prediction have to be downloaded separately (for example using git clone https://github.com/smaegol/PlasFlow).

Manual installation

Of course, PlasFlow repo can be cloned using

git clone https://github.com/smaegol/PlasFlow

but in that case all dependencies have to be installed manually. TensorFlow can be installed as specified above:

pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0rc0-cp35-cp35m-linux_x86_64.whl

python dependencies can be installed using pip:

pip install numpy pandas scipy rpy2 scikit-learn biopython

to install R Biostrings go to https://bioconductor.org/packages/release/bioc/html/Biostrings.html and follow instructions therein.

Perl modules for additional scripts

Perl scripts (like filter_sequences_by_length.pl) included with PlasFlow requires few Perl modules. THey can be easily installed using conda:

conda install -c bioconda perl-bioperl perl-getopt-long

or cpan:

cpan -i Bio::Perl Getopt::longer

or any package manager included in your system (apt, brew)

Getting started

PlasFlow is designed to take a metagenomic assembly and identify contigs which may come from plasmids. It outputs several files, from which the most important is a tabular file containing all predictions (specified with --output option).

Prior to the PlasFlow invocation it is highly recommended to filter sequences by length, leaving only those longer than 1000 bp. PlasFlow, similarly to other kmer-based methods, does not perform well on short sequences, as it is hard to get proper kmer coverage from them. Hence, results for short sequences are unreliable. As metagenomic assemblies usually contain large number of short contigs additional filtering test can improve results and speed up the PlasFlow. It can also prevent too high RAM usage.

To filter sequences using provided Perl script type:

filter_sequences_by_length.pl -input input_dataset.fasta -output filtered_output.fasta -thresh sequence_length_threshold

where sequence length threshold have to be provided in base pairs. Filtered fasta file can be then used directly for PlasFlow prediction.

Options available in PlasFlow include:

--input - specifies input fasta file with assembly contigs to classify [required]
--output - a name of the tsv file with the tabular output of classification [required]
--threshold - manually specified threshold for probability filtering (default = 0.7)
--labels - manually specified custom location of labels file (used for translation from numeric output to actual class names)
--models - custom location of models used for prediction (have to be specified if PlasFlow was installed using pip)
--batch_size - how many sequences can be used in the single batch of kmers frequency calculation

Output

The most important output of PlasFlow is a tabular file containing all predictions (specified with --output option), consiting of several columns including:

contig_id | contig_name | contig_length | id | label | ... --------- | ----------- | ------------- | -- | ----- | ---

where:

contig_idis an internal id of sequence used for the classification
contig_name is a name of contig used in the classification
contig_length shows the length of a classified sequence
id is an internal id of a produced label (classification)
label is the actual classification
... represents additional columns showing probabilities of assignment to each possible class

Sequences can be classified to 26 classes including: chromosome.Acidobacteria, chromosome.Actinobacteria, chromosome.Bacteroidetes, chromosome.Chlamydiae, chromosome.Chlorobi, chromosome.Chloroflexi, chromosome.Cyanobacteria, chromosome.DeinococcusThermus, chromosome.Firmicutes, chromosome.Fusobacteria, chromosome.Nitrospirae, chromosome.other, chromosome.Planctomycetes, chromosome.Proteobacteria, chromosome.Spirochaetes, chromosome.Tenericutes, chromosome.Thermotogae, chromosome.Verrucomicrobia, plasmid.Actinobacteria, plasmid.Bacteroidetes, plasmid.Chlamydiae, plasmid.Cyanobacteria, plasmid.DeinococcusThermus, plasmid.Firmicutes, plasmid.Fusobacteria, plasmid.other, plasmid.Proteobacteria, plasmid.Spirochaetes.

If the probability of assignment to given class is lower than threshold (default = 0.7) then the sequence is treated as unclassified.

Additionaly, PlasFlow produces fasta files containing input sequences binned to plasmids, chromosomes and unclassified.

Test dataset

Test dataset is located in the test folder (file Citrobacter_freundii_strain_CAV1321_scaffolds.fasta). It is the SPAdes 3.9.1 assembly of Citrobacter freundii strain CAV1321 genome (NCBI assembly ID: GCA_001022155.1), which contains 1 chromosome and 9 plasmids. In the same folder the results of classification can be found in the form of tsv file (`Citrobacter_freun

PlasFlow

Install / Use

README