EncodeGeneOntology
Encoder for Gene Ontology terms from their definitions or positions on the GO tree
Install / Use
/learn @datduong/EncodeGeneOntologyREADME
Encode Gene Ontology terms using their definitions or positions on the GO tree.
This is our paper.
We apply the following methods to embed GO terms:
-
Defintion encoder
- BiLSTM
- ELMo
- Transformer based on BERT strategy.
-
Position encoder
- GCN
- Onto2vec
The key objective is to capture the relatedness of GO terms by encoding them into similar vectors.
Consider the example below. We would expect child-parent terms to have similar vector embeddings; whereas, two unrelated terms should have different embeddings. Moreover, child-parent terms are in the same neighborhood, so that the position embeddings should also be the same.

Libraries needed
pytorch, pytorch-pretrained-bert, pytorch-geometric
How to use Definition and Position encoder?
We embed the definition or position of a term. The key idea is that child-parent terms often have simlar defintions or positions in the GO tree, so that we can embed them into comparable vectors.
All models are already trained, and ready to be used. You can download the embeddings here. There are different types of embeddings, you can try any of these embeddings. For example, download these files if you want to use the BiLSTM embedding for Task 1 and 2 discussed in our paper.
You can also use our trained model to produce vectors for any GO definitions, see example script here. You will have to prepare the go.obo definition input in this format here.
Alternatively, you can also train your own embedding by following the same example script. You only need to prepare your train/dev/test datasets into the same format here.
Applications for Definition and Position encoders
Compare functions of proteins
Almost every protein is annotated by a set of GO terms, for example see the Uniprot database. Once you can express each GO term as a vector, then for any 2 proteins, you can compare the sets of terms annotating them. We used the Best-Match Average metric to compare 2 sets; however, there other options to explore. Our example to compare 2 proteins is here.
Predict GO labels based on protein sequences
We can use Uniprot database to train a model that predicts GO labels for an unknown protein sequence. In our paper, we demonstrate that GO embeddings can be used to predict GO labels not included in the training data (zeroshot learning). There are two advantages. First, many machine learning methods exclude rare labels because these methods often have problem when training data contains very rare labels. GO embeddings allow us to adopt the zeroshot learning philosophy, where we train models on labels in training data, but test models on new unseen labels. Second, as the GO database is constantly updating with new terms, we do not need to train a brand new model with each update.
Related Skills
node-connect
353.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
353.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
353.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
