oddgenes

A list of weird gene annotations or things that break bioinformatics assumptions

See also https://github.com/cmdcolin/oddbiology/ for more weird bio

Gene structures

1bp length exon

Evidence given for a 1bp length exon in Arabidopsis and different splicing models are discussed

http://www.nature.com/articles/srep18087

Another 1bp exon is discussed here https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0177959

Microexons in general are an interesting topic and are "involved in important biological processes in brain development and human cancers" (ref https://www.cell.com/molecular-therapy-family/nucleic-acids/fulltext/S2162-2531(23)00013-6) yet are commonly misannotated (e.g. in plants https://www.nature.com/articles/s41467-022-28449-8)

See also cryptic splice sites, cryptic exons, poison exons

0bp length exon

The phenomenon of recursive splicing can remove sequences progressively inside an intron, so there can exist "0bp exons" that are just the splice-site sequences pasted together.

"To identify potential zero nucleotide exon-type ratchet points, we parsed the RNA-Seq alignments to identify novel splice junctions where the reads mapped to an annotated 5' splice site and an unannotated 3' splice site, and the genomic sequence at the 3' splice site junction was AG/GT"

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4529404/

It was found that aberrant recursive splicing could potentially contribute to disease https://www.biorxiv.org/content/10.1101/2025.08.14.666599v1?med=mas

Very large introns

Satellite DNA study uncovers megabase scale introns https://www.biorxiv.org/content/early/2018/12/11/493254

An example in this paper kl-3 spans 4.3 million bp

In human, an example is Dystrophin. "Dystrophin is coded for by the DMD gene – the largest known human gene, covering 2.4 megabases (0.08% of the human genome) at locus Xp21. The primary transcript in muscle measures about 2,100 kilobases and takes 16 hours to transcribe; the mature mRNA measures 14.0 kilobases" https://en.wikipedia.org/wiki/Dystrophin

Note: these large introns require very large amounts of DNA to be transcribed into RNA, before just removing most of the transcribed RNA via intron splicing, which is sort of "wasteful" on a molecular level. The 16-hour transcription time for dystrophin means that rapidly dividing cells cannot finish transcribing it before the next cell division interrupts the process https://pmc.ncbi.nlm.nih.gov/articles/PMC2754300/

Large number of exons

In human, the TTN (titin) gene has ~364 exons, which is almost double the next most NEB (nebulin) at ~184 exons

https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=DetailsSearch&Term=7273

Small introns

"A 2015 study suggests that the shortest known metazoan intron length is 30 base pairs (bp) belonging to the human MST1L gene (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4675715/). The shortest known introns belong to the heterotrich ciliates, such as Stentor coeruleus, in which most (> 95%) introns are 15 or 16 bp long (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5659724/)" https://en.wikipedia.org/wiki/Intron#Distribution

A novel splicing factor may be involved in small introns https://www.news-medical.net/news/20240215/Novel-splicing-mechanism-for-short-introns-discovered.aspx

Very large proteins

An alga described in 2024 encodes a protein PKZILLA-1 that has a mass of 4.7 megadaltons and contains 140 enzyme domains https://cen.acs.org/biological-chemistry/PKZILLA-proteins-smash-protein-size/102/web/2024/08

In human the TTN gene encodes the titin protein (in muscle) at almost 4 megadaltons

The DMD gene above, despite being the largest known human gene (2.4 Mb), encodes a ~427 kDa protein — large, but nowhere near megadalton scale. A shorter isoform Dp71 (~71 kDa) is expressed in non-muscle tissues. https://pmc.ncbi.nlm.nih.gov/articles/PMC49288/

Loop-out exon skipping via inverted Alu repeats

Inverted Alu pairs flanking an exon can fold into an RNA stem-loop hairpin that physically loops the exon out, causing skipping through RNA secondary structure alone rather than splicing factors. ~707 human exons are affected, including TBXT (linked to tail loss in hominoids).

https://academic.oup.com/nar/article/54/6/gkag196/8539533

Backsplicing and circRNAs

The process of "backsplicing" circularizes RNAs. There can be alternative backsplicing too

Figure from Dawoud et al. https://doi.org/10.1016/j.ncrna.2022.09.011

See https://academic.oup.com/nar/article/48/4/1779/5715065

Very large number of isoforms in Dscam

"Dscam has 24 exons; exon 4 has 12 variants, exon 6 has 48 variants, exon 9 has 33 variants, and exon 17 has two variants. The combination of exons 4, 6, and 9 leads to 19,008 possible isoforms with different extracellular domains (due to differences in Ig2, Ig3 and Ig4). With two different transmembrane domains from exon 17, the total possible protein products could reach 38,016 isoforms"

Ref https://en.wikipedia.org/wiki/DSCAM https://www.wikigenes.org/e/gene/e/35652.html

Translational frameshift/Ribosomal frameshift/Programmed ribosomal frameshift

Ref https://en.wikipedia.org/wiki/Translational_frameshift

https://www.sciencedirect.com/topics/neuroscience/ribosomal-frameshifting

SARS-CoV-2 uses ribosomal frameshifting and this video shows a 3D animation of the process, showing a 'pseudoknot' in the RNA contributes to it https://www.youtube.com/watch?v=gLcueW61QMU

Another lecture explaining frameshift in viruses https://youtu.be/b5BX5A3dGUQ?t=2980

In retroviruses like HIV, the gag and pol genes overlap in different reading frames. A -1 ribosomal frameshift at a "slippery sequence" between them produces the Gag-Pol fusion polyprotein at ~5% efficiency, while the other 95% of ribosomes terminate at the gag stop codon and produce only Gag. This ratio is critical — altering it is lethal to the virus. The Gag-Pol polyprotein is then cleaved by the viral protease (which is itself part of the polyprotein) to produce reverse transcriptase, integrase, and protease.

https://en.wikipedia.org/wiki/Gag-pol https://en.wikipedia.org/wiki/Pol_(HIV)

Ribosomal frameshift

Figure from Atkins et al. (2016) showing the ribosome encountering a slippery sequence (X XXY YYZ) and downstream RNA stimulatory element (pseudoknot) that together promote -1 programmed ribosomal frameshifting https://pmc.ncbi.nlm.nih.gov/articles/PMC7618472/

Ribosome hopping

"Ribosome hopping involves ribosomes skipping over large portions of an mRNA without translating them" Ref https://pubmed.ncbi.nlm.nih.gov/24711422/

The classic example is bacteriophage T4 gene 60, where the ribosome bypasses a 50-nucleotide coding gap — about half of ribosomes successfully make the hop https://pmc.ncbi.nlm.nih.gov/articles/PMC107096/

Internal Ribosome Entry Sites (IRES)

An IRES allows ribosomes to initiate translation at an internal position on an mRNA without scanning from the 5' cap. This enables cap-independent translation and is used by many viruses to hijack host ribosomes. Some cellular mRNAs also contain IRES elements, allowing them to be translated under stress conditions when cap-dependent translation is shut down.

https://en.wikipedia.org/wiki/Internal_ribosome_entry_site

Stop codon readthrough/translational readthrough

"Stop codon suppression or translational readthrough occurs when in translation a stop codon is interpreted as a sense codon, that is, when a (standard) amino acid is 'encoded' by the stop codon. Mutated tRNAs can be the cause of readthrough, but also certain nucleotide motifs close to the stop codon. Translational readthrough is very common in viruses and bacteria, and has also been found as a gene regulatory principle in humans, yeasts, bacteria and drosophila.[28][29] This kind of endogenous translational readthrough constitutes a variation of the genetic code, because a stop codon codes for an amino acid. In the case of human malate dehydrogenase, the stop codon is read through with a frequency of about 4%.[30] The amino acid inserted at the stop codon depends on the identity of the stop codon itself: Gln, Tyr, and Lys have been found for the UAA and UAG codons, while Cys, Trp, and Arg for the UGA codon have been identified by mass spectrometry.[31] Extent of readthrough in mammals have widely variable extents, and can broadly diversify the proteome and affect cancer progression.[32] "

https://en.wikipedia.org/wiki/Stop_codon#Translational_readthrough

Stop codon re-assignment: selenocysteine

The amino acid Selenocysteine is coded for by an "opal" (UGA) stop codon (https://en.wikipedia.org/wiki/Selenocysteine)

Is present in all domains of life including humans

As of 2021, 136 human proteins (in 37 families) are known to contain selenocysteine

A stem-loop structure in the mRNA called a SECIS element (https://en.wikipedia.org/wiki/SECIS_element) signals the ribosome to read UGA as selenocysteine instead of stop. The resulting products are called selenoproteins.

Stop codon re-assignment: pyrrolysine

Pyrrolysine also is coded for by the "amber" (UAG) stop codon (https://en.wikipedia.org/wiki/Pyrrolysine), not present in humans

"It is encoded in mRNA by the UAG codon, which in most organisms is the 'amber' stop codon. This requires only the presence of the pylT gene, which encodes an unusual transfer RNA (tRNA) with a CUA anticodon, and the pylS gene, which encodes a class II aminoacyl-tRNA synthetase that charges the pylT-derived tRNA with pyrrolysine. "

There are several other stop codon modifications described here https://www.nature

Oddgenes

Install / Use

README