Egapx
Eukaryotic Genome Annotation Pipeline-External caller scripts and documentation
Install / Use
/learn @ncbi/EgapxREADME
Eukaryotic Genome Annotation Pipeline - External (EGAPx)
EGAPx is the publicly accessible version of the updated NCBI Eukaryotic Genome Annotation Pipeline.
EGAPx takes an assembly FASTA file, a taxid of the organism, and RNA-seq data. Based on the taxid, EGAPx will pick protein sets and HMM models. The pipeline runs miniprot or prosplign to align protein sequences, STAR to align short RNA-seq reads, and minimap2 to align long RNA-seq reads to the assembly. Protein alignments and RNA-seq read alignments are then passed to Gnomon for gene prediction. In the first step of Gnomon, the short alignments are chained together into putative gene models. In the second step, these predictions are further supplemented by ab-initio predictions based on HMM models. Functional annotation is added to the final structural annotation set based on the type and quality of the model and orthology information. Optionally, noncoding RNAs (tRNAs, rRNAs, snoRNAs and snRNAs) can be predicted using tRNAscan and cmsearch. The final output includes annotationed features in ASN format which can be used to prepare GenBank annotation submissions using the included prepare_submission script, as well as annotation in GFF3 format for pre-submission analysis and easier modification of predicted features.
We currently have protein datasets posted that are suitable for most vertebrates, arthropods, echinoderms, and some plants:
-
Chordata - Mammalia, Sauropsida, Actinopterygii (ray-finned fishes), other Vertebrates
-
Insecta - Hymenoptera, Diptera, Lepidoptera, Coleoptera, Hemiptera
-
Arthropoda - Arachnida, other Arthropoda
-
Echinodermata
-
Cnidaria
-
Monocots - Liliopsida
-
Eudicots - Asterids, Rosids, Fabids, Caryophyllales
:warning: Fungi, protists and nematodes are out-of-scope for EGAPx. We recommend using a different annotation method for these organisms.
Security Notice: EGAPx has dependencies in and outside of its execution path that include several thousand files from the NCBI C++ toolkit, and more than a million total lines of code. Static Application Security Testing has shown a small number of verified buffer overrun security vulnerabilities. Users should consult with their organizational security team on risk and if there is concern, consider mitigating options like running via VM or cloud instance.
License: See the EGAPx license here.

Contents
<!-- TOC -->- Prerequisites
- Installation and setup
- Input data format
- Input example
- Run EGAPx
- Test run
- Offline mode
- Output
- Interpreting Output
- Intermediate files
- Modifying default parameters
- Submitting EGAPx annotation to NCBI
- FAQ
- References
- Contact us
Prerequisites
- Docker or Singularity
- AWS batch, UGE cluster, or a r6a.4xlarge machine (32 CPUs, 256GB RAM)
- Nextflow v.23.10.1
- Python v.3.9+
Notes:
- General configuration for AWS Batch is described in the Nextflow documentation at https://www.nextflow.io/docs/latest/aws.html
- See Nextflow installation at https://www.nextflow.io/docs/latest/getstarted.html
Installation and setup
git clone https://github.com/ncbi/egapx.git
cd egapx
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Input data format
Input to EGAPx is in the form of a YAML file.
-
The following are the required fields:
genome: path to assembled genome in FASTA format taxid: NCBI Taxonomy identifier of the target organism short_reads: RNA-seq short reads data- See here for genome FASTA requirements/recommendations
- You can obtain taxid from the NCBI Taxonomy page.
-
The following are optional metadata configuration parameters:
- Locus tag prefix. One to 9-letter prefix to use for naming genes on this genome assembly. If an official locus tag prefix was already reserved from an INSDC organization (GenBank, ENA or DDBJ) for the given BioSample and BioProject pair, provide here. This is helpful if you want to use the final GFF3 file for studies prior to submission. Otherwise, use the default prefix 'egapxtmp', which can be updated later when preparing annotation files for submission.
locus_tag_prefix: egapxtmp
Input genome
- The assembled genome should be in FASTA format. Sequence titles and some special characters are allowed in the FASTA definition line, but shorter and simpler names are less likely to cause issues.
- The genome sequence does not need to be repeat-masked prior to annotation. EGAPx performs masking steps as part of the pipeline.
- :warning: The assembled genome should be screened for contamination prior to running EGAPx. See the NCBI Foreign Contamination Screen for a fast, user-friendly contamination screening tool.
- We recommend keeping organelle sequences in the genome FASTA to prevent inaccurate read mapping, but EGAPx does not support organelle annotation.
Running EGAPx with short RNA-seq reads
-
RNA-seq short reads data can be supplied in any one of the following ways:
short_reads: [ nested list of read set names and paths, FASTA or FASTQ files] short_reads: path_to_short_reads_list.txt short_reads: [ array of SRA run IDs or Study IDs] short_reads: SRA Entrez query -
If you are using local reads, the recommended input formatting is a nested list of read set names and paths or a list of read set names and paths in a separate file. For smaller RNA-seq datasets, you can follow the nested list format below. Here the filenames for the reads can be anything, but the set names for each set has to be unique.
short_reads: - - single_end_library_name1 # set name - - path/to/se1_reads.fq # file for single-end reads - - single_end_library_name2 - - path/to/se2_reads.fq - - paired_end_library_name1 # set name - - path/to/pe1_reads_R1.fq # file for paired-end R1 reads - path/to/pe1_reads_R2.fq # file for paired-end R2 reads - - paired_end_library_name2 - - path/to/pe2_reads_R1.fq - path/to/pe2_reads_R2.fq -
For a large number of local RNA-seq runs, you can list them in a file with a set name and a filepath in each line:
seset1 path/to/se1_reads_R1.fq # file for single-end reads peset1 path/to/pe1_reads_R1.fq # file for paired-end R1 reads peset1 path/to/pe1_reads_R2.fq # file for paired-end R2 reads peset2 path/to/pe2_reads_R1.fq peset2 path/to/pe2_reads_R2.fqThen you can read that file from the input yaml
short_reads: path/to/reads.txtSee
examples/input_D_farinae_small_reads.txt) andexamples/input_D_farinae_small_readlist.yamlfor an example using this strategy. -
NCBI SRA datasets can be specified as an array:
short_reads: - SRR8506572 - SRR9005248- If you provide an SRA Study ID, all the SRA run ID's belonging to that Study ID will be included in the EGAPx run.
-
To specify an SRA entrez query:
short_reads: txid43150[Organism] AND 50:350[ReadLength] AND (illumina[Platform] OR bgiseq[Platform]) AND biomol_rna[Properties]Note: Some SRA entrez query can return a large number of SRA run id's. To prevent EGAPx from using a large number of SRA runs, please run the query first at the NCBI SRA page. If there are too many SRA runs, then select a few of them and list it in the input yaml.
Running EGAPx with short and long RNA-seq reads
-
Optionally, you can also include long reads RNA-seq data from SRA or local files (FASTA or FASTQ, not BAM) using the same formatting structure for short reads, using the label
long_reads:genome: path to assembled genome in FASTA format taxid: NCBI Taxonomy identifier of the target organism short_reads: RNA-seq short reads data long_reads: RNA-seq long reads data- See
examples/input_Hirundo_rustica.yamlfor an example.
- See
-
To specify an SRA entrez query:
short_reads: txid43150[Organism] AND 50:350[ReadLength] AND (illumina[Platform] OR bgiseq[Platform]) AND biomol_rna[Properties] long_reads: txid43150[Organism] AND (oxford_nanopore[Platform] OR pacbio_smrt[Platform]) AND biomol_rna[Properties] -
We have not rigorously tested EGAPx performance using clus
Related Skills
node-connect
338.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
338.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.4kCommit, push, and open a PR
