PhosphoLingo
Phosphorylation site prediction software using protein language models and convolutional neural networks
Install / Use
/learn @jasperzuallaert/PhosphoLingoREADME
############ PhosphoLingo ############
Introduction ############ PhosphoLingo is an advanced phosphorylation predictor using deep learning and Protein Language Models. It predicts likely phosphorylation sites from sequence alone. Given the used data format, it can easily be extended to other post-translational modifications (PTMs).
Supported functionality
The PhosphoLingo tool can be called for three different goals:
- training a new model (train) and evaluating it on a test set (included)
- make predictions for new sequences using an existing model (predict)
- visualize important features using an existing model (visualize)
Input data format
#################
The input sequences should be provided using a FASTA format, with the positive and negative phosphorylation (or by extension, any PTM) sites indicated. This can be done by adding a trailing # to the modified residue for positive annotations, or a @ for negative annotations.
Dummy example for Y phosphorylation, with two positive annotations, four negative annotations, and three Y residues without annotations:
.. code-block::
>dummy_protein
MKPWETDY#MGPFRKIY@Y@WIVFYESGSDMDCNY#CYNFKGEFITQSICHPNDPSKEGDQARTWCIS
ETLRSRTTFYDLSIGLTKSFFFNRY@CIWFCFIWLSDDRMTKY@
When making predictions/visualizing for protein sequences of which no annotations might be available, the sites to be predicted should be indicated by one of the two tokens (# or @) - in this scenario, labels are ignored so both will work equally.
Configuration file format
#########################
.. configs: https://github.com/jasperzuallaert/PhosphoLingo/tree/master/configs
The architecture and training hyperparameters, dataset locations, and more information is stored in a configuration file, in a .json format. Example configurations can be found in the configs directory. The different options are given below.
====================== ========================== ===
name default value
====================== ========================== ===
training_set "Scop3P/ST/PF" The location of FASTA files to train, validate and test (default) or to train and validate (if test_set != "default")
test_set "default" "default" if regular training/validation/test scheme is followed, otherwise the location of the fold[0-9].fasta files for evaluation
test_fold 0 Only used if test_set != "default". Chooses the fold[0-9].fasta file to use for evaluation.
save_model False If True, the model will be kept after training.
representation "ProtTransT5_XL_UniRef50" The representation to use. Supported: onehot, ESM1_small, ESM1b, ESM2_150M, ESM2_650M, ESM2_3B, CARP_640M, ProtTransT5_XL_UniRef50, Ankh_base, Ankh_large
freeze_representation True Whether to freeze the representation during training or fine-tune it. All models in the paper are trained with this parameter set to True
receptive_field 65 The number of residues (centered around the candidate P-site) that are given to the CNN
conv_depth 3 The number of convolutional blocks in the architecture
conv_width 9 The filter size of the convolutional layers
conv_channels 200 The number of output channels (= number of filters) per convolutional layer
final_fc_neurons 64 The number of neurons in the final fully connected layer
dropout 0.2 Only used if batch_norm == False. Dropout p for all dropout layers
max_pool_size 1 The pooling size used in the pooling layers
batch_norm False Chooses whether to use batch normalization as regularization (if True), or dropout (if False)
batch_size 4 The number of protein fragments in each batch
warm_up_epochs 1.5 The number of epochs for learning rate warm-up (linear from 0.1*LR)
learning_rate 1e-4 Learning rate for AdamW
pos_weight 1 The weight for positive samples in the loss calculation max_epochs 10 The maximum number of epochs, if no early stopping occurs ====================== ========================== ===
Usage
For hints on how to run PhosphoLingo:
.. code-block::
python phospholingo -h
.. code-block::
usage: phospholingo [-h] {train,predict,visualize} ...
positional arguments:
{train,predict,visualize}
the desired PhosphoLingo subprogram to run
train train a new prediction model
predict predict using an existing model
visualize calculate SHAP values using an existing model
optional arguments:
-h, --help show this help message and exit
Train a new model ################# Usage:
.. code-block ::
usage: phospholingo train [-h] json
positional arguments:
json the .json configuration file for training a new model
optional arguments:
-h, --help show this help message and exit
To train a new model, supply a .json file with the desired configuration. The AUPRC, AUROC, and precisions at recall of 0.8 and 0.6 will be logged in the resulting directory. If specified in the configuration file, the checkpoint of the model will also be saved.
Training can be done via two data setups:
- (default) training/validation/test sets: The default training run. This is achieved by setting
test_settodefaultin the config. In this case, training will be done on thetrain.fastafile in the specified data directory (datasetin the config), early stopping will be done usingvalid.fasta, and test metrics are computed on thetest.fastadata. - cross-dataset evaluation: Specifically to reproduce results in the paper or to check model transferability between datasets. This is achieved by setting
test_setto the desired data directory on which evaluation should be done. Additionally, specify the fold to test on by settingtest_foldto any number between 0 and 9. Thefold[0-9].fastafile will be used for evaluation, and all proteins present will be removed from the training and validation sets.
Predict using an existing model ############################### Usage:
.. code-block ::
usage: phospholingo predict [-h] model dataset out
positional arguments:
model the location of the saved model
dataset the dataset for which to make predictions
out the output file, will be written in a csv format
optional arguments:
-h, --help show this help message and exit
You can make predictions on an unseen dataset, using a pretrained prediction model, and writing results to an out csv file. As indicated before, the dataset should be in a FASTA format, and sites to be predicted should be followed by either a # or @ symbol. The actual annotations are ignored, so either symbol will work equivalently.
Mention data format again
Visualize important features using an existing model #################################################### Usage:
.. code-block ::
usage: phospholingo visualize [-h] model dataset out_values out_img
positional arguments:
model the location of the saved model
dataset the dataset for which to visualize important features
out_values the output SHAP scores file, will be written in a txt format
out_img the normalized average SHAP scores per position, as an image file
optional arguments:
-h, --help show this help message and exit
You can make visualizations for a pretrained prediction model, on an unseen dataset. Output values will be stored in out_values (.txt format), and an image will be generated to out_img (.jpg, .png, .svg, ...).
Setting the maximum system batch size
.. utils: https://github.com/jasperzuallaert/PhosphoLingo/blob/master/phospholingo/utils.py
As Protein Language Models can be very resource-heavy to use, especially when considering the larger models and when also fine-tuning them during training, the user can set their maximum batch size for specific situations. This is done in the get_gpu_max_batchsize function in utils. Users can redefine this function so that appropriate batch sizes are returned for their system. A non-optimized example for different batch sizes using different representations is implemented, though this has not been thoroughly optimized.
Extra files
Pre-trained phosphorylation models (.ckpt format) can be downloaded from following locations. The models are trained on the Scop3P-ST-PF and Scop3P-Y-PF datasets.
====================== ======= ========= Model Targets Link ====================== ======= ========= ProtT5-XL-U50 ST https://huggingface.co/JasperZ/PhosphoLingo_ST/resolve/main/PhosphoLingo_ST_new.ckpt ProtT5-XL-U50 Y https://huggingface.co/JasperZ/PhosphoLingo_Y/resolve/main/PhosphoLingo_Y_new.ckpt ====================== ======= =========
.. _data: https://github.com/jasperzuallaert/PhosphoLingo/blob/master/data/
.. predictions: https://github.com/jasperzuallaert/PhosphoLingo/blob/master/predictions/
Datasets (FASTA format with # and @ annotations) used in this study are found in data.
Configuration files (.json format) can be found in configs_. If you want to run the preset configurations, you should only change the following parameters: training_set, test_set, ``t
Related Skills
node-connect
353.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
353.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
353.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
