Eukaryotic Genome Annotation Pipeline - External (EGAPx)

EGAPx is the publicly accessible version of the updated NCBI Eukaryotic Genome Annotation Pipeline.

EGAPx takes an assembly FASTA file, a taxid of the organism, and RNA-seq data. Based on the taxid, EGAPx will pick protein sets and HMM models. The pipeline runs miniprot or prosplign to align protein sequences, STAR to align short RNA-seq reads, and minimap2 to align long RNA-seq reads to the assembly. Protein alignments and RNA-seq read alignments are then passed to Gnomon for gene prediction. In the first step of Gnomon, the short alignments are chained together into putative gene models. In the second step, these predictions are further supplemented by ab-initio predictions based on HMM models. Functional annotation is added to the final structural annotation set based on the type and quality of the model and orthology information. Optionally, noncoding RNAs (tRNAs, rRNAs, snoRNAs and snRNAs) can be predicted using tRNAscan and cmsearch. The final output includes annotationed features in ASN format which can be used to prepare GenBank annotation submissions using the included prepare_submission script, as well as annotation in GFF3 format for pre-submission analysis and easier modification of predicted features.

We currently have protein datasets posted that are suitable for most vertebrates, arthropods, echinoderms, and some plants:

Chordata - Mammalia, Sauropsida, Actinopterygii (ray-finned fishes), other Vertebrates
Insecta - Hymenoptera, Diptera, Lepidoptera, Coleoptera, Hemiptera
Arthropoda - Arachnida, other Arthropoda
Echinodermata
Cnidaria
Monocots - Liliopsida
Eudicots - Asterids, Rosids, Fabids, Caryophyllales

:warning: Fungi, protists and nematodes are out-of-scope for EGAPx. We recommend using a different annotation method for these organisms.

Security Notice: EGAPx has dependencies in and outside of its execution path that include several thousand files from the NCBI C++ toolkit, and more than a million total lines of code. Static Application Security Testing has shown a small number of verified buffer overrun security vulnerabilities. Users should consult with their organizational security team on risk and if there is concern, consider mitigating options like running via VM or cloud instance.

License: See the EGAPx license here.

alt text

Prerequisites
Installation and setup
Input data format
Input example
Run EGAPx
Test run
Offline mode
Output
Interpreting Output
Intermediate files
Modifying default parameters
Submitting EGAPx annotation to NCBI
FAQ
References
Contact us

Prerequisites

Docker or Singularity
AWS batch, UGE cluster, or a r6a.4xlarge machine (32 CPUs, 256GB RAM)
Nextflow v.23.10.1
Python v.3.9+

Notes:

General configuration for AWS Batch is described in the Nextflow documentation at https://www.nextflow.io/docs/latest/aws.html
See Nextflow installation at https://www.nextflow.io/docs/latest/getstarted.html

Installation and setup

git clone https://github.com/ncbi/egapx.git
cd egapx

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Input data format

Input to EGAPx is in the form of a YAML file.

The following are the required fields:

genome: path to assembled genome in FASTA format
taxid: NCBI Taxonomy identifier of the target organism 
short_reads: RNA-seq short reads data

See here for genome FASTA requirements/recommendations
You can obtain taxid from the NCBI Taxonomy page.

The following are optional metadata configuration parameters:
- Locus tag prefix. One to 9-letter prefix to use for naming genes on this genome assembly. If an official locus tag prefix was already reserved from an INSDC organization (GenBank, ENA or DDBJ) for the given BioSample and BioProject pair, provide here. This is helpful if you want to use the final GFF3 file for studies prior to submission. Otherwise, use the default prefix 'egapxtmp', which can be updated later when preparing annotation files for submission.
```
  locus_tag_prefix: egapxtmp 
```

Input genome

The assembled genome should be in FASTA format. Sequence titles and some special characters are allowed in the FASTA definition line, but shorter and simpler names are less likely to cause issues.
The genome sequence does not need to be repeat-masked prior to annotation. EGAPx performs masking steps as part of the pipeline.
:warning: The assembled genome should be screened for contamination prior to running EGAPx. See the NCBI Foreign Contamination Screen for a fast, user-friendly contamination screening tool.
We recommend keeping organelle sequences in the genome FASTA to prevent inaccurate read mapping, but EGAPx does not support organelle annotation.

Running EGAPx with short RNA-seq reads

RNA-seq short reads data can be supplied in any one of the following ways:

short_reads: [ nested list of read set names and paths, FASTA or FASTQ files]
short_reads: path_to_short_reads_list.txt
short_reads: [ array of SRA run IDs or Study IDs]
short_reads: SRA Entrez query

If you are using local reads, the recommended input formatting is a nested list of read set names and paths or a list of read set names and paths in a separate file. For smaller RNA-seq datasets, you can follow the nested list format below. Here the filenames for the reads can be anything, but the set names for each set has to be unique.

short_reads:
 - - single_end_library_name1   # set name
   - - path/to/se1_reads.fq     # file for single-end reads
 - - single_end_library_name2
   - - path/to/se2_reads.fq
 - - paired_end_library_name1   # set name  
   - - path/to/pe1_reads_R1.fq  # file for paired-end R1 reads
     - path/to/pe1_reads_R2.fq  # file for paired-end R2 reads
 - - paired_end_library_name2
   - - path/to/pe2_reads_R1.fq
     - path/to/pe2_reads_R2.fq

For a large number of local RNA-seq runs, you can list them in a file with a set name and a filepath in each line:
```
seset1 path/to/se1_reads_R1.fq # file for single-end reads
peset1 path/to/pe1_reads_R1.fq # file for paired-end R1 reads
peset1 path/to/pe1_reads_R2.fq # file for paired-end R2 reads
peset2 path/to/pe2_reads_R1.fq
peset2 path/to/pe2_reads_R2.fq
```
Then you can read that file from the input yaml
```
short_reads: path/to/reads.txt
```
See examples/input_D_farinae_small_reads.txt) and examples/input_D_farinae_small_readlist.yaml for an example using this strategy.
NCBI SRA datasets can be specified as an array:
```
short_reads:
  - SRR8506572
  - SRR9005248
```
- If you provide an SRA Study ID, all the SRA run ID's belonging to that Study ID will be included in the EGAPx run.
To specify an SRA entrez query:
```
short_reads: txid43150[Organism] AND 50:350[ReadLength] AND (illumina[Platform] OR bgiseq[Platform]) AND biomol_rna[Properties]
```
Note: Some SRA entrez query can return a large number of SRA run id's. To prevent EGAPx from using a large number of SRA runs, please run the query first at the NCBI SRA page. If there are too many SRA runs, then select a few of them and list it in the input yaml.

Running EGAPx with short and long RNA-seq reads

Optionally, you can also include long reads RNA-seq data from SRA or local files (FASTA or FASTQ, not BAM) using the same formatting structure for short reads, using the label long_reads:
```
genome: path to assembled genome in FASTA format
taxid: NCBI Taxonomy identifier of the target organism 
short_reads: RNA-seq short reads data
long_reads: RNA-seq long reads data
```
- See examples/input_Hirundo_rustica.yaml for an example.

To specify an SRA entrez query:

short_reads: txid43150[Organism] AND 50:350[ReadLength] AND (illumina[Platform] OR bgiseq[Platform]) AND biomol_rna[Properties]
long_reads: txid43150[Organism] AND (oxford_nanopore[Platform] OR pacbio_smrt[Platform]) AND biomol_rna[Properties]

We have not rigorously tested EGAPx performance using clus

Egapx

Install / Use

README