SkillAgentSearch skills...

RFPlasmid

Predicting plasmid contigs from assemblies using single copy marker genes, plasmid genes, kmers - Developed by Linda van der Graaf

Install / Use

/learn @aldertzomer/RFPlasmid
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

License: GPL v3 PIP CONDA

RFPlasmid

Predicting plasmid contigs from assemblies

Webinterface

A web-interface to test single fasta files is available here: http://klif.uu.nl/rfplasmid/

Table of Contents

Abstract

Predicting plasmid contigs from assemblies using single copy marker genes, plasmid genes, kmers

Linda van der Graaf-van Bloois, Jaap Wagenaar, Aldert Zomer

Introduction: Antimicrobial resistant (AMR) genes in bacteria are often carried on plasmids. Since these plasmids can spread the AMR genes between bacteria, it is important to know if the genes are located on highly transferable plasmids or in the more stable chromosomes. Whole genome sequence (WGS) analysis makes it easy to determine if a strain contains a resistance gene, however, it is not easy to determine if the gene is located on the chromosome or on a plasmid as genome sequence assembly generally results in 50-300 DNA fragments (contigs). With our newly developed prediction tool, we analyze the composition of these contigs to predict their likely source, plasmid or chromosomal. This information can be used to determine if a resistant gene is chromosomally located or on a plasmid. The tool is optimized for 19 different bacterial species, including Campylobacter, E. coli, and Salmonella, and can also be used for metagenomic assemblies.

Methods: The tool identifies the number of chromosomal marker genes, plasmid replication genes and plasmid typing genes using CheckM and DIAMOND Blast, and determines pentamer frequencies and contig sizes per contig. A prediction model was trained using Random Forest on an extensive set of plasmids and chromosomes from 19 different bacterial species and validated on separate test sets of known chromosomal and plasmid contigs of the different bacteria.

Results: We show that RFplasmid is able to predict chromosomal and plasmid contigs with error rates ranging from 0.002% to 4.66% and that the use of taxon specific models can be superior to a general plasmid prediction model. Single copy chromosomal marker genes, plasmid genes, k-mer content and length of contig all appear to be informative, however k-mer content is highly specific for taxa. Prediction of small contigs remains unreliable, since these contigs consists primarily of repeated sequences present in both plasmid and chromosome, e.g. transposases or because k-mer content or marker genes cannot be easily identified.

Conclusion: The newly developed tool is able to determine if contigs are chromosomal or plasmid with a very high specificity and sensitivity (up to 99%) and can be very useful to analyze WGS data of bacterial genomes and their antimicrobial resistance genes.

Running RFPlasmid

$ rfplasmid --initialize # Only once after installing it. See "Getting the software" below
$ rfplasmid

Error; no arguments. Required to specificy --input and --species
usage: rfplasmid.py [-h] [--species SPECIES] [--input INPUT] [--specieslist] [--jelly] [--out OUT] [--debug] [--training] [--threads THREADS] [--version]

optional arguments:
  -h, --help         show this help message and exit
  --species SPECIES  define species (required)
  --input INPUT      directory with input fasta files (required)
  --specieslist      list of available species models
  --jelly            run jellyfish as kmer-count (faster)
  --out OUT          specify output directory
  --debug            no cleanup of intermediate files
  --training         trainings mode Random Forest
  --threads THREADS  specify number of threads to be used, default is max available threads up to 16 threads
  --version          print version number

# Example
rfplasmid --species Campylobacter --input inputfolder --jelly --threads 8 --out outputfolder

A folder containing .fasta file is required as input.

--jelly requires a functional jellyfish install. Greatly speeds up the analysis. Strongly recommended as our kmer profiling method in Python is slow

Read specieslist.txt or run rfplasmid --specieslist for species specific models. We have a general Enterobacteriaceae model instead of a species model. All others are species except for the "Generic" model which can be used for unknown or metagenomics samples.

Getting the software

Using Conda

thanks to https://github.com/rpetit3. Installs CheckM database as well. A Google Colab notebook in this repository gives an example. The script rfplasmid is placed in ~/.local/bin and assumes that is in your PATH, which is according to the systemd specification (https://www.freedesktop.org/software/systemd/man/file-hierarchy.html). If not, please run the export PATH line.

$ conda install -c bioconda rfplasmid 
$ # or alternatively: conda create -n rfplasmid -c conda-forge -c bioconda rfplasmid ; conda activate rfplasmid
$ rfplasmid --initialize # Bash helper script to locate rfplasmid.py and initialize the plasmid databases
$ export PATH=$PATH:~/.local/bin/ #only necessary if you have not included ~/.local/bin in your path (unusual but it has been observed). 
$ rfplasmid

Using Pip

Installs most requirements except DIAMOND and JellyFish and R (see below). You need to download HMMER, Prodigal and a database for CheckM if you have never installed it. A Google Colab notebook in this repository gives an example if you want to do this systemwide. This is much more work than Conda and requires more skills.

$  pip3 install rfplasmid
$  export PATH=$PATH:~/.local/bin # pip installs in ~/.local/bin and it should be in your path but some distros don't have this set (even though they should).
$  rfplasmid --initialize #We makes use of a bash helper script to locate the rfplasmid.py file and to download the plasmid databases as they are too large for pip
$  export PATH=$PATH:~/.local/bin/ #only necessary if you have not included ~/.local/bin in your path (unusual but it has been observed). 
$  rfplasmid

Required if you have never installed CheckM before

CheckM relies on a number of precalculated data files which can be downloaded from https://data.ace.uq.edu.au/public/CheckM_databases/. Decompress the file to an appropriate folder and run the following to inform CheckM of where the files have been placed. The example below uses wget to download the an archive of file and installs them in your homedir.

$  sudo apt install hmmer # CheckM needs HMMER. See http://hmmer.org/documentation.html for other methods of downloading and installing it
$  cd ~
$  wget https://github.com/hyattpd/Prodigal/releases/download/v2.6.3/prodigal.linux
$  cp prodigal.linux ~/bin/prodigal #we assume you have a ~/bin/ folder and it's in your path. 
$  chmod +x ~/bin/prodigal  
$  mkdir .checkm
$  cd .checkm
$  wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
$  tar xzvf checkm_data_2015_01_16.tar.gz
$  checkm data setRoot ~/.checkm

Dependencies you need to install when installing RFPlasmid using Pip

RandomForest package in R ( https://cran.r-project.org/web/packages/randomForest/index.html ) (likely already installed).

$  R
> install.packages("randomForest") #likely also already installed

DIAMOND ( https://github.com/bbuchfink/diamond )

$ wget http://github.com/bbuchfink/diamond/releases/download/v0.9.24/diamond-linux64.tar.gz
$ tar xzf diamond-linux64.tar.gz
$ cp diamond ~/bin/diamond

Strongly recommended: Jellyfish ( http://www.genome.umd.edu/jellyfish.html )

$ wget https://github.com/gmarcais/Jellyfish/releases/download/v2.2.10/jellyfish-linux
$ cp jellyfish-linux ~/bin/jellyfish
$ chmod +x ~/bin/jellyfish

For advanced users that want to install the latest version from Github

You can get the source and using git and run from the folder you downloaded it to. You will need to install the requirements by hand as well

$ git clone https://github.com/aldertzomer/RFPlasmid.git
$ cd RFPlasmid
$ bash getdb.sh # downloads and formats the plasmid DBs
$ python3 rfplasmid.py

Installing requirements. Assumes you have ~/bin/ in your PATH. Depending on your setup you may need to follow the systemwide version (see further below)

Python 3 with pandas ( https://pandas.pydata.org/)

$  pip3 install pandas

CheckM ( https://ecogenomics.github.io/CheckM/ ). According to the github page of CheckM:

$  pip3 install numpy
$  pip3 install scipy
$  pip3 install pysam
$  pip3 install checkm-genome
$  cd ~
$  mkdir checkm_data
$  cd checkm_data
$  wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
$  tar xzvf checkm_data_2015_01_16.tar.gz
$  checkm data setRoot ~/checkm_data

RandomForest package in R ( https://cran.r-project.org/web/packages/randomForest/index.html )

$  R
> install.packages("randomForest")

DIAMOND ( https://github.com/bbuchfink/diamond )

$ wget http://github.com/bbuchfink/diamond/releases/download/v0.9.24/diamond-linux64.tar.gz
$ tar xzf diamond-linux64.tar.gz

Related Skills

View on GitHub
GitHub Stars50
CategoryDevelopment
Updated4mo ago
Forks7

Languages

Python

Security Score

92/100

Audited on Nov 3, 2025

No findings