KIPEs
Knowledge-based Identification of Pathway Enzymes (KIPEs) performs an automatic annotation of the flavonoid biosynthesis steps in a new transcriptome of genome sequence assembly.
Install / Use
/learn @bpucker/KIPEsREADME
KIPEs (Knowledge-based Identification of Pathway Enzymes)
KIPEs is available on our webserver
Please get in touch if you need help running KIPEs on your own dataset: Boas Pucker (email)
Abstract
This tool enables the identification of candidate sequences in a collection of peptide sequences, transcript sequences, or in a genome sequence. An initial BLAST (BLASTp, tBLASTn) search is used to get putative sequences which are than analysed in a global alignment with MAFFT and screened for the presence of conserved residues and conserved domains. As a proof of concept, this tool was applied for the identification of genes in the flavonoid biosynthesis.
Input options
Peptide sequence collection (result of an assembly annotation process)
If a collection of peptide sequences is provided, BLASTp is applied with a manually curated collection of bait sequences. Candidates are identified based on similarity cutoffs. Next, these sequences are subjected to a global alignment via MAFFT. The overall similarity of candidate and bait sequences is calculated. Additionally, the presence of conserved (functionally relevant) amino acid residues and conserved domains is inspected based on a reference sequence. If available, a peptide sequence collection should be provided instead of a transcript sequence set or a genome sequence. The computational costs of the analysis are substantially lower.
Transcript sequences (transcriptome assembly)
If a collection of transcript sequences if provided, putative open reading frames are identified in the first step. All putative peptide sequences encoded in these transcripts are considered if their length exceeds a certain cutoff (e.g. 50 amino acids). The resulting peptides are subjected to the analysis described above.
Genome sequence (genome assembly)
A tBLASTn is applied to identify regions in the genome sequence, which might encode the desired peptide. As BLAST hits only indicate exons and might be fragmented, BLAST hits are group to putative genes. Fragments of a putative gene are extended to account for incomplete hits at exon borders. This includes the detection of splice sites (currently only canonical GT-AG combinations). If full length peptide sequences are provided as query, the stop codon should be indicated by a * at the end of the peptide sequence. We recommend to run a proper gene prediction tool like AUGUSTUS or GeMoMa if possible. These dedicated tools will outperform the very basic gene structure identification methods implemented in KIPEs in most cases.
Result files
HTML summary
The final output of KIPEs is a HTML document called 'SUMMARY.html'. This table shows the best candidates for all steps in the pathway. It is possible to specify the order of genes in a pathway using the --pathway option (see below for details). Previously described amino acid residues are checked in all candidate sequences and the results are summarized in this table. Mismatches of conserved residues are indicated by highlighting in red.
Similarity matrix
One similarity matrix is generated per bait sequence file. The similarity of all candidate sequences against all bait sequences is displayed. Although this table is generated as a text file, it is possible to open these files as tables (e.g. with Calc).
Conserved residues
The presence of all conserved residues is analysed in all candidate sequences. Presence/absence are indicated in a table comprising all sequences and all residues. A summary of these results is diplayed in a HTML file as described above.
Conserved regions
The output format of this analysis of conserved regions matches the output format of conserved residues. The percentage of identical amino acid residues in the domain is calculated for each candidate sequence.
Installation
While some dependencies are required, this tool does not require an installation. Downloading and executing the script on a Linux system is sufficient. There is currently no support for other operating systems. Most required modules are included in the initial Python installation, but dendropy might not be available on all systems.
Python3 (sudo apt-get install python3.8). It is also possible to use other Python3 versions.
dendropy (sudo apt install python3-pip && python3 -m pip install git+https://github.com/jeetsukumaran/DendroPy.git)
MAFFT (sudo apt-get install -y mafft)
BLAST (sudo apt-get install ncbi-blast+) or HMMER(conda install -c bioconda hmmer)
FastTree (sudo apt-get install -y fasttree) and/or RAxML-NG (precompiled binaries recommended)
Usage
General recommendation
Full paths should be used to specify input and output files and folders. Sequence names should not contain white space characters like spaces and TABs. Underscores can be used to replace spaces.
Running the main function (KIPEs)
Usage:
python3 KIPEs3.py --baits <DIR> --out <DIR> --subject <FILE>
or
python3 KIPEs3.py --baits <DIR> --out <DIR> --subjectdir <DIR>
Mandatory:
Bait sequences
--baits STR Directory with (multiple) FASTA files
Output directory
--out STR Output directory
Input sequences
--subject STR Multiple FASTA file with sequences to screen
--subjectdir STR Directory containing multiple FASTA file with sequences to screen
Optional:
--positions STR Directory with text files (one per step in pathway)
--seqtype STR Defines type of input sequence (pep|rna|dna)[pep]
--cpus INT Number of threads in BLAST runs [10]
--scoreratio FLOAT BLAST score ratio of self vs. input sequences [0.3]
--simcut FLOAT Minimal similarity of BLAST hits [40.0]
--checks STR Validation of input data (on|off)[on]
--genesize INT Maximal gene size (for tblastn hit grouping) [5000]
--minsim FLOAT Minimal similarity required in global alignment [0.4]
--minres FLOAT Minimal proportion of conserved residues [-1.0]
--minreg FLOAT Minimal proportion of conserved regions [-1.0]
--pathway STR Full path to text file with pathway enzyme names (default is alphabetical sorting)
--possibilities INT Maximal number of enzyme functions to consider per sequence [3]
--mafft STR Full path to MAFFT (if not in your $PATH)
--blastp STR Full path to the BLASTp binary (if not in your $PATH)
--tblastn STR Full path to the tBLASTn binary (if not in your $PATH)
--makeblastdb STR Full path to the makeblastdb binary (if not in your $PATH)
--fasttree STR Full path to the FastTree binary
--forester STR Activates the automatic construction of gene trees (on|off)[off]
--exp STR Gene expression file (activates heatmap construction)
--rcut FLOAT Minimal correlation cutoff [0.3]
--pcut FLOAT Maximal p-value cutoff [0.05]
--minexp INT Minimal expression per gene [30]
--baits is the full path to a folder containing (mutliple) FASTA files. The filename needs to match the gene name. Extension should be '.fasta' or '.fa'.
--out is the full path to an output folder which will be created if necessary. All temporary and result files will be stored in this folder and subfolders therein.
--subject is the full path to an input multiple FASTA file. A collection of peptide (pep), transcript (rna), or genomic (dna) sequences can serve as input. The appropriate input data type needs to be specified via --seqtype (pep|rna|dna).
--subjectdir can be used to run KIPEs on multiple data sets (to analyse multiple species). All subject files in the provided folder are analysed consecutively. It is important that all data sets are of the same sequence type (--seqtype (pep|rna|dna)).
--positions (or --residues) is the full path to a folder containing text files matching the provided FASTA files. The filename needs to match the gene name. Example: CHS.fasta contains the bait sequences and CHS.txt contains information about relevant amino acid residues and domains. File extension should be '.txt' or '.res'. The header line starts with an exclamation mark followed by the reference sequence name. It is crucial that the name of this sequence is matched by one entry in the bait sequences FASTA file. Each of the following lines contains information about one important amino acid residue or a domain. The type of feature is indicated in the first column using R to specify residues or D to specify domains. Fields are separated by tab (not space!). The format of entries of residues and domains is slightly different as you can see in this example:
!AtCHS
R R 13 comment1
R Q,X 16 comment2
R R 17 comment3
D malonyl-CoA_binding_motif 313 329 comment4
Residues: Important residues have their amino acid in the second column (one letter code!) and the position in the third column. It is possible to specify multiple alternative amino acids for one position as indicated by the 'X' in the second entry. Columns following the third column can be used for user comments and are ignored by KIPEs.
Domains: The domain entry indicator (D) is followed by the name of the domain in the first c
Related Skills
node-connect
347.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
