SkillAgentSearch skills...

RemoVecSec

Module and script to remove contamination in assembled genomes before submission to ncbi

Install / Use

/learn @htafer/RemoVecSec
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

remoVecSec

remoVecSec is a Python toolkit for detecting and removing contamination in assembled genomes, designed to facilitate pre-submission quality control for NCBI WGS and similar repositories. It provides command-line tools and Python modules to automate the screening and cleaning of genome assemblies from common sources of contamination (vectors, adaptors, E. coli, phage, mitochondrial, and more).

Features

  • Automated detection and removal of vector, adaptor, and mitochondrial contamination
  • BLAST-based screening using NCBI-recommended databases
  • Modular Python code for integration into custom pipelines
  • Command-line interface for batch processing
  • Output of cleaned genome sequences and summary reports

Installation

Clone this repository and install dependencies (requires Python 3.7+ and Biopython):

git clone https://github.com/htafer/remoVecSec.git
cd remoVecSec
pip install -r requirements.txt

You will also need BLAST+ and VecScreen installed and accessible in your PATH. See NCBI BLAST+ download and VecScreen.

Usage

The main entry point is the remower.py script:

python3 bin/remower.py -g genome.fa -c contam_db -v vec_db -m mito_db

Arguments:

  • --genomefile, -g: Input genome FASTA file (required)
  • --dbvec, -v: VecScreen database (optional)
  • --dbmito, -m: Organelle (mitochondrial) BLAST database (optional)
  • --dbcont, -c: Contaminant BLAST database (optional)
  • --dist, -d: Maximal distance for merging two intervals (default: 50)

Example:

python3 bin/remower.py -g my_genome.fa -c contam_in_euks.fa -v adaptors_for_screening_euks.fa -m mito.nt

Module Overview

  • remoVecSec/removeVec.py: Vector/adaptor detection and removal
  • remoVecSec/removeMito.py: Mitochondrial contamination detection
  • remoVecSec/removeContaminant.py: General contaminant detection (E. coli, phage, etc.)
  • remoVecSec/removeUtils.py: Utility functions for interval merging and FASTA correction

Each module can be used independently or as part of the main workflow.

How It Works

  1. Contaminant Screening:
    • Uses BLAST or VecScreen to identify contaminant regions in the genome.
  2. Interval Merging:
    • Overlapping or adjacent contaminant intervals are merged for robust trimming.
  3. Sequence Correction:
    • Contaminated or flagged regions are trimmed or removed from the genome FASTA.
  4. Reporting:
    • Outputs cleaned sequences and a summary of modifications.

Requirements

  • Python 3.7+
  • Biopython
  • NCBI BLAST+ suite (blastn, makeblastdb)
  • VecScreen (for adaptor screening)

References

License

This project is licensed under the MIT License. See the LICENSE file for details.


Appendix: NCBI Contamination Screening Protocols

The following sections summarize the NCBI protocols and command-line examples for contaminant, adaptor, mitochondrial, rRNA, and foreign chromosome screening. For full details, see the original NCBI documentation and the remainder of this README.

Common contaminant screen

Databases

  1. File to screen for the common contaminants in eukaryotic sequences:

        ftp://ftp.ncbi.nlm.nih.gov/pub/kitts/contam_in_euks.fa.gz
    

Contains the cloning artifacts that are likely to show up as contaminants across all eukaryotic species: vector sequences, E.coli genome, phage genomes, bacterial Insertion Sequences and transposons.

  1. File to screen for the common contaminants in prokaryotic sequences:

        ftp://ftp.ncbi.nlm.nih.gov/pub/kitts/contam_in_prok.fa
    

Contains phiX174.

These files need to be unzipped and the resulting FASTA sequence files formatted as BLAST databases using the makeblastdb program.

Programs

blastn and makeblastdb are contained in the blast+ package which can be installed following the instruction in the BLAST help documents.

"BLAST Command Line Applications User Manual":

        https://www.ncbi.nlm.nih.gov/books/NBK279671/

"Standalone BLAST Setup for Windows PC":

        https://www.ncbi.nlm.nih.gov/books/NBK52637/

"Standalone BLAST Setup for Unix":

        https://www.ncbi.nlm.nih.gov/books/NBK52640/

Execution

A BLAST search is run against either the contam_in_euks or contam_in_prok database, depending on the origin of the input sequences. The common contaminant BLAST results are filtered for hits over various length and percent identity cut-offs.

Command line:

  1. for screening eukaryotic sequences:

        blastn -query _input_fasta_sequences_ -db contam_in_euks -task megablast -word_size 28 -best_hit_overhang 0.1 -best_hit_score_edge 0.1 -dust yes -evalue 0.0001 -perc_identity 90.0 -outfmt "7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" | awk '($3>=98.0 && $4>=50)||($3>=94.0 && $4>=100)||($3>=90.0 && $4>=200)'
    

OR with an intermediate file, these 2 commands:

        blastn -query _input_fasta_sequences_ -db contam_in_euks -task megablast -word_size 28 -best_hit_overhang 0.1 -best_hit_score_edge 0.1 -dust yes -evalue 0.0001 -perc_identity 90.0 -outfmt "7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" -out _out_file_

        awk '($3>=98.0 && $4>=50)||($3>=94.0 && $4>=100)||($3>=90.0 && $4>=200)' _out_file_
  1. for screening prokaryotic sequences:

        blastn -query _input_fasta_sequences_ -db contam_in_prok -task megablast -word_size 28 -best_hit_overhang 0.1 -best_hit_score_edge 0.1 -dust yes -evalue 0.0001 -perc_identity 90.0 -outfmt "7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" | awk '($3>=98.0 && $4>=50)||($3>=94.0 && $4>=100)||($3>=90.0 && $4>=200)'
    

OR with an intermediate file, these 2 commands:

        blastn -query _input_fasta_sequences_ -db contam_in_prok -task megablast -word_size 28 -best_hit_overhang 0.1 -best_hit_score_edge 0.1 -dust yes -evalue 0.0001 -perc_identity 90.0 -outfmt "7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" -out _out_file_

        awk '($3>=98.0 && $4>=50)||($3>=94.0 && $4>=100)||($3>=90.0 && $4>=200)' _out_file_

Adaptor screen

VecScreen (https://www.ncbi.nlm.nih.gov/tools/vecscreen/) is run against either the adaptors_for_screening_euks.fa database or adaptors_for_screening_proks.fa database, depending on the origin of the input sequences. Hits are filtered to retain only those matches that VecScreen classifies as "Strong" or "Moderate" (see: https://www.ncbi.nlm.nih.gov/tools/vecscreen/about/#Categories).

Databases

The adaptors_for_screening databases are available here:

        ftp://ftp.ncbi.nlm.nih.gov/pub/kitts/adaptors_for_screening_euks.fa

        ftp://ftp.ncbi.nlm.nih.gov/pub/kitts/adaptors_for_screening_proks.fa

These FASTA sequence files need to be formatted as BLAST databases using the makeblastdb program.

Programs

The VecScreen standalone program is available here:

        ftp://ftp.ncbi.nlm.nih.gov/blast/demo/vecscreen

The script to filter the VecScreen results is here:

        ftp://ftp.ncbi.nlm.nih.gov/pub/kitts/VSlistTo1HitPerLine.awk

Execution

Command line:

  1. for screening eukaryotic sequences:

        vecscreen -d adaptors_for_screening_euks.fa -f3 -i _input_fasta_sequences_ -o _vs_output_file_
    
  2. for screening prokaryotic sequences:

        vecscreen -d adaptors_for_screening_proks.fa -f3 -i _input_fasta_sequences_ -o _vs_output_file_
    

Filter out the "Weak" and "Suspect Origin" hits:

        VSlistTo1HitPerLine.awk suspect=0 weak=0 _vs_output_file_ > _filtered_vs_output_file_

Mitochondrial genome screen

BLAST is used to screen the input sequences against a database of the mitochondrial genome sequences in the NCBI Reference Sequences (RefSeq) collection.

Database

        ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/mito.nt.gz

This file needs to be unzipped and the resulting FASTA sequence file formatted as a BLAST database using the makeblastdb program.

Programs

blastn and makeblastdb are contained in the blast+ package (see above).

Execution

The BLAST hits to mitochondrial genomes are filtered for hits over 98.6% identity and at least 120 bases long.

        blastn -query _input_fasta_sequences -db mito.nt -out % -task megablast -word_size 28 -best_hit_overhang 0.1 -best_hit_score_edge 0.1 -dust yes -evalue 0.0001 -perc_identity 98.6 -soft_masking true -outfmt 7 | awk '$4>=120' > _filtered_mito_output_file_

Ribosomal RNA screen

Ribosomal RNA genes are the cause of many false positives because the include some segments that align to distantly related organisms. Segments that match rRNA genes are identified so that such segments are not reported as being foreign.

BLAST is used to screen the input sequences against a database of the rRNA gene sequences .

Database


        ftp://ftp.ncbi.nlm.nih.gov/pub/kitts/rrna.gz

This file needs to be unzipped and the resulting FASTA sequence file formatted as a BLAST database using the makeblastdb program.

Programs

blastn and makeblastdb are contained in the blast+ package (see above).

Execution

The BLAST hits to rRNA genes are filtered for hits over 95% identity and at least 100 bases long.

        blastn -query _i
View on GitHub
GitHub Stars14
CategoryDevelopment
Updated8mo ago
Forks1

Languages

Python

Security Score

87/100

Audited on Jul 28, 2025

No findings