SkillAgentSearch skills...

KIPEs

Knowledge-based Identification of Pathway Enzymes (KIPEs) performs an automatic annotation of the flavonoid biosynthesis steps in a new transcriptome of genome sequence assembly.

Install / Use

/learn @bpucker/KIPEs

README

DOI

KIPEs (Knowledge-based Identification of Pathway Enzymes)

KIPEs is available on our webserver

Please get in touch if you need help running KIPEs on your own dataset: Boas Pucker (email)

Abstract

This tool enables the identification of candidate sequences in a collection of peptide sequences, transcript sequences, or in a genome sequence. An initial BLAST (BLASTp, tBLASTn) search is used to get putative sequences which are than analysed in a global alignment with MAFFT and screened for the presence of conserved residues and conserved domains. As a proof of concept, this tool was applied for the identification of genes in the flavonoid biosynthesis.

Input options

Peptide sequence collection (result of an assembly annotation process)

If a collection of peptide sequences is provided, BLASTp is applied with a manually curated collection of bait sequences. Candidates are identified based on similarity cutoffs. Next, these sequences are subjected to a global alignment via MAFFT. The overall similarity of candidate and bait sequences is calculated. Additionally, the presence of conserved (functionally relevant) amino acid residues and conserved domains is inspected based on a reference sequence. If available, a peptide sequence collection should be provided instead of a transcript sequence set or a genome sequence. The computational costs of the analysis are substantially lower.

Transcript sequences (transcriptome assembly)

If a collection of transcript sequences if provided, putative open reading frames are identified in the first step. All putative peptide sequences encoded in these transcripts are considered if their length exceeds a certain cutoff (e.g. 50 amino acids). The resulting peptides are subjected to the analysis described above.

Genome sequence (genome assembly)

A tBLASTn is applied to identify regions in the genome sequence, which might encode the desired peptide. As BLAST hits only indicate exons and might be fragmented, BLAST hits are group to putative genes. Fragments of a putative gene are extended to account for incomplete hits at exon borders. This includes the detection of splice sites (currently only canonical GT-AG combinations). If full length peptide sequences are provided as query, the stop codon should be indicated by a * at the end of the peptide sequence. We recommend to run a proper gene prediction tool like AUGUSTUS or GeMoMa if possible. These dedicated tools will outperform the very basic gene structure identification methods implemented in KIPEs in most cases.

Result files

HTML summary

The final output of KIPEs is a HTML document called 'SUMMARY.html'. This table shows the best candidates for all steps in the pathway. It is possible to specify the order of genes in a pathway using the --pathway option (see below for details). Previously described amino acid residues are checked in all candidate sequences and the results are summarized in this table. Mismatches of conserved residues are indicated by highlighting in red.

Similarity matrix

One similarity matrix is generated per bait sequence file. The similarity of all candidate sequences against all bait sequences is displayed. Although this table is generated as a text file, it is possible to open these files as tables (e.g. with Calc).

Conserved residues

The presence of all conserved residues is analysed in all candidate sequences. Presence/absence are indicated in a table comprising all sequences and all residues. A summary of these results is diplayed in a HTML file as described above.

Conserved regions

The output format of this analysis of conserved regions matches the output format of conserved residues. The percentage of identical amino acid residues in the domain is calculated for each candidate sequence.

Installation

While some dependencies are required, this tool does not require an installation. Downloading and executing the script on a Linux system is sufficient. There is currently no support for other operating systems. Most required modules are included in the initial Python installation, but dendropy might not be available on all systems.

Python3 (sudo apt-get install python3.8). It is also possible to use other Python3 versions.

dendropy (sudo apt install python3-pip && python3 -m pip install git+https://github.com/jeetsukumaran/DendroPy.git)

MAFFT (sudo apt-get install -y mafft)

BLAST (sudo apt-get install ncbi-blast+) or HMMER(conda install -c bioconda hmmer)

FastTree (sudo apt-get install -y fasttree) and/or RAxML-NG (precompiled binaries recommended)

Usage

General recommendation

Full paths should be used to specify input and output files and folders. Sequence names should not contain white space characters like spaces and TABs. Underscores can be used to replace spaces.

Running the main function (KIPEs)

Usage:
  python3 KIPEs3.py --baits <DIR> --out <DIR> --subject <FILE>
  or
  python3 KIPEs3.py --baits <DIR> --out <DIR> --subjectdir <DIR>

Mandatory:
  Bait sequences
  --baits          STR    Directory with (multiple) FASTA files
  
  Output directory
  --out            STR    Output directory

  Input sequences
  --subject        STR    Multiple FASTA file with sequences to screen
  --subjectdir     STR    Directory containing multiple FASTA file with sequences to screen
		
  Optional:
  --positions      STR    Directory with text files (one per step in pathway)
  --seqtype        STR    Defines type of input sequence (pep|rna|dna)[pep]
 
  --cpus           INT    Number of threads in BLAST runs [10]
  --scoreratio     FLOAT  BLAST score ratio of self vs. input sequences [0.3]
  --simcut         FLOAT  Minimal similarity of BLAST hits [40.0]
  --checks         STR    Validation of input data (on|off)[on]
   
  --genesize      INT    Maximal gene size (for tblastn hit grouping) [5000]
  --minsim        FLOAT  Minimal similarity required in global alignment [0.4]
  --minres        FLOAT  Minimal proportion of conserved residues [-1.0]
  --minreg        FLOAT  Minimal proportion of conserved regions [-1.0]
  --pathway       STR    Full path to text file with pathway enzyme names (default is alphabetical sorting)
  --possibilities INT    Maximal number of enzyme functions to consider per sequence [3]
   
  --mafft          STR    Full path to MAFFT (if not in your $PATH)
  --blastp         STR    Full path to the BLASTp binary (if not in your $PATH)
  --tblastn        STR    Full path to the tBLASTn binary (if not in your $PATH)
  --makeblastdb    STR    Full path to the makeblastdb binary (if not in your $PATH)
  
  --fasttree       STR    Full path to the FastTree binary
  
  --forester       STR    Activates the automatic construction of gene trees (on|off)[off]
  
  --exp            STR    Gene expression file (activates heatmap construction)
  --rcut           FLOAT  Minimal correlation cutoff [0.3]
  --pcut           FLOAT  Maximal p-value cutoff [0.05]
  --minexp         INT    Minimal expression per gene [30]

--baits is the full path to a folder containing (mutliple) FASTA files. The filename needs to match the gene name. Extension should be '.fasta' or '.fa'.

--out is the full path to an output folder which will be created if necessary. All temporary and result files will be stored in this folder and subfolders therein.

--subject is the full path to an input multiple FASTA file. A collection of peptide (pep), transcript (rna), or genomic (dna) sequences can serve as input. The appropriate input data type needs to be specified via --seqtype (pep|rna|dna).

--subjectdir can be used to run KIPEs on multiple data sets (to analyse multiple species). All subject files in the provided folder are analysed consecutively. It is important that all data sets are of the same sequence type (--seqtype (pep|rna|dna)).

--positions (or --residues) is the full path to a folder containing text files matching the provided FASTA files. The filename needs to match the gene name. Example: CHS.fasta contains the bait sequences and CHS.txt contains information about relevant amino acid residues and domains. File extension should be '.txt' or '.res'. The header line starts with an exclamation mark followed by the reference sequence name. It is crucial that the name of this sequence is matched by one entry in the bait sequences FASTA file. Each of the following lines contains information about one important amino acid residue or a domain. The type of feature is indicated in the first column using R to specify residues or D to specify domains. Fields are separated by tab (not space!). The format of entries of residues and domains is slightly different as you can see in this example:

!AtCHS
R R 13 comment1
R Q,X 16 comment2
R R 17 comment3
D malonyl-CoA_binding_motif 313 329 comment4

Residues: Important residues have their amino acid in the second column (one letter code!) and the position in the third column. It is possible to specify multiple alternative amino acids for one position as indicated by the 'X' in the second entry. Columns following the third column can be used for user comments and are ignored by KIPEs.

Domains: The domain entry indicator (D) is followed by the name of the domain in the first c

Related Skills

View on GitHub
GitHub Stars17
CategoryDevelopment
Updated1mo ago
Forks6

Languages

Python

Security Score

95/100

Audited on Feb 12, 2026

No findings