Helixer is a tool for structural genome annotation. It utilizes Deep Neural Networks and a Hidden Markov Model to directly provide primary gene models in a gff3 file. It’s performant and applicable to a wide variety of genomes. However, users should be aware that this software is under ongoing development and improvements.

Goal
Web tool
Installation
Network architecture
Example usage
Expert mode
Citation

Goal

Perform ab initio prediction of the gene structure for your species. That is, to perform "gene calling" and identify which base pairs in a genome belong to the UTR/CDS/Intron regions of genes. We have four trained models available for the four lineages: fungi, land_plant, vertebrate and invertebrate.

Web tool

Inference on one to a few genomes can be performed using the Helixer web tool: https://plabipd.de/helixer_main.html. You can then skip the installation instructions down below.

Submission instructions:

submit your genome/sequence in a valid FASTA format

minimum sequence length of a single record: 25 kbp

maximum file size (including all records): 1 GByte (Hint: if your genome exceeds the file size you could split your genome by chromosome or submit a compressed file ('.gz' '.zip' and '.bz2' are supported)

Installation

The installation time depends on the installation method you are using (e.g. docker/singularity or manual installation(only for Linux)) and your experience in using GitHub, Python and CUDA. The time it takes a decently experienced user to install Helixer is 20-30 minutes.

There is the possibility to install Helixer on macOS which requires a few adjustments. Instructions can be found here.

GPU requirements

For realistically sized datasets, an Nvidia GPU or an Apple Silicon GPU (M1/M2/M3) using Apple Metal Performance Shaders (MPS) GPU acceleration (beta support) will be necessary for acceptable performance.

The example below and all provided models should run on an Nvidia GPU with 11GB Memory (e.g. GTX 1080 Ti) and with 8 Gb (e.g. GTX 1080).

The driver for the GPU must also be installed. The following drivers (top level version) were shown to work with Helixer (you DON'T need to install one of these versions specifically, every Nvidia driver should work):

nvidia-driver-495
nvidia-driver-510
nvidia-driver-525
nvidia-driver-555

via Docker / Singularity (recommended)

See https://github.com/gglyptodon/helixer-docker

Additionally, please see notes on usage, which will differ slightly from the example below.

Manual installation

Please see full installation instructions. Manual installation is only available for Linux operating systems.

Galaxy ToolShed

There is also a Galaxy installation of Helixer which you can use for inference.

Helixer's architecture

Example usage/inference (gene calling)

If you want to use Helixer to annotate a genome with a provided model, start here. The best models are:

| Lineage (choose the lineage your species belongs to for prediction) | Model filename | Available since (year/month/date) | |:--------------------------------------------------------------------|:----------------------------|:----------------------------------| | fungi | fungi_v0.3_a_0100.h5 | 2022/11/21 |
| land_plant | land_plant_v0.3_a_0080.h5 | 2022/11/28 | | vertebrate | vertebrate_v0.3_m_0080.h5 | 2022/12/30 | | invertebrate | invertebrate_v0.3_m_0100.h5 | 2022/12/30 |

Acquire models

The best models for all lineages are best downloaded by running:

# by default the models will be at /home/<user>/.local/share/Helixer/models
scripts/fetch_helixer_models.py

If desired, the --lineage (land_plant, vertebrate, invertebrate, and fungi) can be specified, or --all released models can be fetched. If the models should be downloaded to another path you can specify fetch_helixer_models.py --custom-path <path_to_download_models_to>. If you want Helixer.py to use this custom path to check for new releases/lineage models, please provide --downloaded-model-path <path_to_download_models_to> when running Helixer.py. Otherwise, the default folder will be checked.

Downloaded models (and any new releases) can also be found at https://zenodo.org/records/10836346, but we recommend simply using the autodownload.

Note: to use a non-default model, set --model-filepath <path/to/model.h5>', to override the lineage default for Helixer.py.

1-step inference (recommended)

The command below converts the input DNA sequence to numerical matrices, predicts base-wise class probabilities (is a base pair part of the intergenic region, UTR, CDS or intron) with a Deep Learning based model and post-processes those probabilities into primary gene models returning a gff3 output file. Explanations for the parameters used in this example can be found a little further down below. It should take around 3 minutes for the 1-step-inference demo below to run (when using a GPU).

# download an example chromosome
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-47/fasta/arabidopsis_lyrata/dna/Arabidopsis_lyrata.v.1.0.dna.chromosome.8.fa.gz
# you can also unzip the fasta file (i.e. gunzip Arabidopsis_lyrata.v.1.0.dna.chromosome.8.fa.gz),
# but it's not necessary as Helixer can handle zipped fasta files as well

# run all Helixer components from fa to gff3
Helixer.py --lineage land_plant --fasta-path Arabidopsis_lyrata.v.1.0.dna.chromosome.8.fa.gz  \
  --species Arabidopsis_lyrata --gff-output-path Arabidopsis_lyrata_chromosome8_helixer.gff3

1-step inference parameters

| Parameter | Default | Explanation | |:------------------|:--------|:--------------------------------------------------------------------------------------------------| | --fasta-path | / | FASTA input file | | --gff-output-path | / | Output GFF3 file path | | --species | / | Species name. Will be added to the GFF3 file. | | --lineage | / | What model to use for the annotation. Options are: vertebrate, land_plant, fungi or invertebrate. |

3-step inference

The three main steps the command above executes can also be run separately:

fasta2h5.py: conversion of the DNA sequence to numerical matrices
HybridModel.py: prediction of base-wise probabilities with the Deep Learning based model defined/programmed in this file
helixer_post_bin (part of another repository): post-processing into primary gene models

Explanations for the parameters used in this example can be found a little further down below. You can also check out the respective help functions or the Helixer options documentation for additional usage information, if necessary. It should take around 5 minutes for the 3-step-inference demo below to run (when using a GPU).

# example broken into individual steps
# ---------------------------------------
# Consider adding the --subsequence-length parameter:  This number should be large enough to contain typical gene lengths of your species
# while being divisible by at least the timestep width of the used model, which is typically 9. (Lineage dependent defaults)
# Recommendations per lineage: vertebrate: 213840, land_plant: 64152/106920, fungi: 21384, invertebrate: 213840
# Default: 21384
fasta2h5.py --species Arabidopsis_lyrata --h5-output-path Arabidopsis_lyrata.h5 --fasta-path Arabidopsis_lyrata.v.1.0.dna.chromosome.8.fa.gz

# the exact location ($HOME/.local/share/) of the model comes from appdirs
# the model was downloaded when fetch_helixer_models.py was called above
# this example code is for _linux_ and will need to be modified for other OSs
# the command runs HybridModel.py in verbose mode with overlap (this will
# improve prediction quality at subsequence ends by creating and overlapping 
# sliding-window predictions.)
HybridModel.py --load-model-path $HOME/.local/share/Helixer/models/land_plant/land_plant_v0.3_a_0080.h5 \
     --test-data Arabidopsis_lyrata.h5 --overlap --val-test-batch-size 32 -v --predict-phase

# order of input parameters:
# helixer_post_bin <genome.h5> <predictions.h5> <window_size> <edge_threshold> <peak_threshold> <min_coding_length> <output.gff3>
helixer_post_bin Arabidopsis_lyrata.h5 predictions.h5 100 0.1 0.8 60 Arabidopsis_lyrata_chromosome8_helixer.gff3

Output: The main output of the above commands is the gff3 file (Arabidopsis_lyrata_chromosome8_helixer.gff3) which contains the predicted genic structure (where the exo

Helixer

Install / Use

README

Table of contents