SkillAgentSearch skills...

Metaeuk

MetaEuk - sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics

Install / Use

/learn @soedinglab/Metaeuk
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

MetaEuk - sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics

BioConda Install Biocontainer Pulls Docker Pulls Build Status

MetaEuk is a modular toolkit designed for large-scale gene discovery and annotation in eukaryotic metagenomic contigs. MetaEuk combines the fast and sensitive homology search capabilities of MMseqs2 with a dynamic programming procedure to recover optimal exons sets. It reduces redundancies in multiple discoveries of the same gene and resolves conflicting gene predictions on the same strand. MetaEuk is GPLv3-licensed open source software that is implemented in C++ and available for Linux and macOS. The software is designed to run efficiently on multiple cores.

<!--- TOC START -->

Table of Contents

<!--- TOC END -->

Publication

Levy Karin E, Mirdita M and Soeding J. MetaEuk – sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics. Microbiome. 2020; 8:48

<p align="center"><img src="https://github.com/soedinglab/metaeuk/blob/master/imgs/MetaEuk.png" height="250"/></p>

Installation

MetaEuk can be used by compiling from source (see below) or downloading a statically compiled version. It requires a 64-bit system (check with uname -a | grep x86_64) with at least the SSE4.1 instruction set (check by executing cat /proc/cpuinfo | grep sse4_1 on Linux or sysctl -a | grep machdep.cpu.features | grep SSE4.1 on MacOS).

# install via conda
conda install -c conda-forge -c bioconda metaeuk
# static Linux AVX2 build
wget https://mmseqs.com/metaeuk/metaeuk-linux-avx2.tar.gz; tar xzvf metaeuk-linux-avx2.tar.gz; export PATH=$(pwd)/metaeuk/bin/:$PATH
# static Linux SSE4.1 build
wget https://mmseqs.com/metaeuk/metaeuk-linux-sse41.tar.gz; tar xzvf metaeuk-linux-sse41.tar.gz; export PATH=$(pwd)/metaeuk/bin/:$PATH
# static macOS build (universal binary with SSE4.1/AVX2/M1 NEON)
wget https://mmseqs.com/metaeuk/metaeuk-osx-universal.tar.gz; tar xzvf metaeuk-osx-universal.tar.gz; export PATH=$(pwd)/metaeuk/bin/:$PATH

Precompiled binaries for other architectures (ARM64, PPC64LE) and very old AMD/Intel CPUs (SSE2 only) are available at https://mmseqs.com/metaeuk.

Input

MetaEuk will search for eukaryotic protein-coding genes in contigs based on similarity to reference proteins or protein profiles. You could either use the easy-predict workflow directly on Fasta files or convert them to MMseqs2-formatted databases by running the createdb command and later on specific MetaEuk modules. Read here about available reference database. You can use contigs.fna and proteins.faa from the tests/two_contigs directory as a small toy example.

Terminology

A gene call is an optimal set of exons predicted based on similarity to a specific target (T) in a specific contig (C) and strand (S). In the following it is referred to as a TCS or as a call. After redundancy reduction (see details below), the representative TCS is referred to as prediction.

Running MetaEuk

Main Modules:

  easy-predict      	Predict proteins from contigs (fasta/db) based on similarities to targets (fasta/db) and return a fasta & GFF
  predictexons      	Call optimal exon sets based on protein similarity
  reduceredundancy  	Cluster metaeuk calls which share an exon and select representative
  unitesetstofasta  	Create fasta output from optimal exon sets (and (1) a TSV map between headers and internal identifiers, (2) GFF summary)
  groupstoacc     	Create a TSV output from representative to calls
  taxtocontig     	Assign taxonomic labels to MetaEuk predictions and contigs by majority voting
  

Using MMseqs2 commands within MetaEuk:

MMseqs2 commands are available through MetaEuk and no additional MMseqs2 installation is required. For example, the MMseqs2 command mmseqs createdb can be replaced with metaeuk createdb, mmseqs databases with metaeuk databases, etc. Please see also the MMseqs2 Wiki for more info about MMseqs2 commands.

Important parameters:

 --min-length        minimal number of codons in putative protein fragment
 -e                  maximal E-Value to retain a match between a putative protein fragment and a reference target 
 --metaeuk-eval      maximal combined E-Value to retain an optimal exon set
 --metaeuk-tcov      minimal length ratio of combined set to target 
 --exhaustive-search if referenceDB is a profile database, should be added (before version 4 called slice-search)
 --max-exon-sets     maximal number of exon sets on each contig and strand for a given target (from version 6)

easy-predict workflow:

This workflow combines the following MetaEuk modules into a single step: predictexons, reduceredundancy and unitesetstofasta (each of which is detailed below). Its inputs are contigs (either as a Fasta file or a previously created database) and targets (either as a Fasta file of protein sequences or a previously created database of proteins or protein profiles). It will run the modules and output the predictions in Fasta format (as well as a GFF format).

metaeuk easy-predict contigsFasta/contigsDB proteinsFasta/referenceDB predsResults tempFolder

It will result in predsResults.fas (protein sequences), predsResults.codon.fas, predsResults.headersMap.tsv and predsResults.gff.

Calling optimal exons sets:

This module will extract all putative protein fragments from each contig and strand, query them against the reference targets and use dynamic programming to retain for each T the optimal compatible exon set from each C & S (thus creating TCS calls).

metaeuk predictexons contigsDB referenceDB callsResultDB tempFolder --metaeuk-eval 0.0001 -e 100 --min-length 40

Since this step involves a search, it is the most time-demanding of all analyses steps. Upon completion, it will output a database (contigs are keys), where each line contains information about a TCS and its exon (multi-exon TCSs will span several lines).

OPTIONAL - calling of sub-optimal exon sets:

By default, MetaEuk calls a single and optimal compatible exon set from each C & S for each T. If you are interested in calling several matches to a certain T from each C & S (for example, to look for gene duplications), you can change the default value of max-exon-sets to the number of sets to look for (from version 6). A few important notes:

  • If max-exon-sets > 1, then it is no longer guaranteed that TCS is a unique identifier. Therefore, when parsing the output of such runs, it is recommended to use TCS together with low_contig as the identifier (see details about the MetaEuk header).
  • If I run with --max-exon-sets > 1, am I guaranteed to get ALL the predictions I get when running --max-exon-sets 1? No! You most likely see all of them but this is not guaranteed because some complex cases can arise due to the redundancy reduction stage. You can see an example for such a case under tests/sub_opt/readme.txt.
  • Running with max-exon-sets > 1 is mainly useful in case your contigs are long enough to contain several genes (less common in metagenomic data)

Reducing redundancy:

If there are homologies in referenceDB (e.g., T1 is highly similar to T2), the same optimal exon set from a C & S combination will be called more than once. This module will group together TCSs that share an exon and will choose their representative prediction. By default, it will greedily obtain a subset of the predictions, such that there is no overlap of predictions on the same contig and strand (to allow same-strand overlaps, run with --overlap 1).

metaeuk reduceredundancy callsResultDB predsResultDB predGroupsDB

Upon completion, it will output: predsResultDB and predGroupsDB. predsResultDB contains information about the predictions (same format as callsResultDB). Each line of predGroupsDB maps from a prediction to all TCSs that share an exon with it.

Converting to Fasta and GFF:

The callsResultDB/predsResultDB produced by the modules above, can be used to extract the sequences of the predicted protein-coding ge

Related Skills

View on GitHub
GitHub Stars204
CategoryDevelopment
Updated2d ago
Forks25

Languages

C

Security Score

100/100

Audited on Apr 6, 2026

No findings