LucaProt
LucaProt: A novel deep learning framework that incorporates protein amino acid sequence and structural information to predict protein function.
Install / Use
/learn @alibaba/LucaProtREADME
LucaProt
LucaProt(DeepProtFunc) is an open source project developed by Alibaba and licensed under the Apache License (Version 2.0).
This product contains various third-party components under other open source licenses.
See the NOTICE file for more information.
Notice:
This project provides the Python dependency environment installation file, installation commands, and the running command of the trained LucaProt model for inference or prediction, which can be found in this repository. These models are compatible with Linux, Mac OS, and Windows systems, supporting both CPU and GPU configurations for inference tasks.
TimeLine
-
2025-05-01
Add the deployable web application (based on flask, insrc/app/)
Start the service on the server,cd LucaProt/src/app # modify the service port in app.py python app.pyThen the service can be accessed using the browser on the client: http://${server_ip}:8000
-
2025-04-17:
Add the post-processing workflow to classify the viral RdRPs predicted by LucaProt into our 180 supergroups or novel supergroups.
(Guidance listed inPostProcessingWorkflow.mdorPostProcessingWorkflow_zh.mdof this project).
- 2024-09-01:
Optimize inference and prediction code to run on GPU with small graphics memory, such asA10.
Introduction
LucaProt: A novel deep learning framework that incorporates protein amino acid sequence and structural information to predict protein function.
1. Model
1) Model Introduction
We developed a new deep learning model, namely, Deep Sequential and Structural Information Fusion Network for Proteins Function Prediction (DeepProtFunc/LucaProt), which takes into account protein sequence and structural information to facilitate the accurate annotation of protein function.
Here, we applied LucaProt to identify viral RdRP.
2) Model Architecture
We treat protein function prediction as a classification problem. For example, viral RdRP identification is a binary-class classification task, and protein general function annotation is a multi-label classification task. The model includes five modules: Input, Tokenizer, Encoder, Pooling, and Output. Its architecture is shown in Figure 1.
<center> <img src="pics/LucaProt.png"/>Figure 1 The Architecture of LucaProt
</center>3) Model Input/Output
Use the amino acid letter sequence as the input of our model. The model outputs the function label of the input protein, which is a single tag (binary-class classification or multi-class classification) or a set of tags (multi-label classification).
2. Dependence
System: Ubuntu 20.04.5 LTS
Python: 3.9.13
Download anaconda: <a href="https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh" title="anaconda"> anaconda </a>
Cuda: <a href="https://developer.nvidia.com/cuda-11-7-0-download-archive" title="cuda11.7 (torch==1.13.1)"> cuda11.7 (torch==1.13.1)</a>
# Select 'YES' during installation for initializing the conda environment
sh Anaconda3-2022.10-Linux-x86_64.sh
# Source the environment
source ~/.bashrc
# Verification
conda
# Install env and python 3.9.13
conda create -n lucaprot python=3.9.13
# activate env
conda activate lucaprot
# Install git
sudo apt-get update
sudo apt install git-all
# Enter the project
cd LucaProt
# Install
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
3. Inference
You can simply use this project to infer or predict for unknown sequences.
1) Prediction from one sequence
cd LucaProt/src/prediction/
sh run_predict_one_sample.sh
Note: the embedding matrix of the sample is real-time predictive.
Or:
cd LucaProt/src/
# using GPU(cuda=0)
export CUDA_VISIBLE_DEVICES="0,1,2,3"
python predict_one_sample.py \
--protein_id protein_1 \
--sequence MTTSTAFTGKTLMITGGTGSFGNTVLKHFVHTDLAEIRIFSRDEKKQDDMRHRLQEKSPELADKVRFFIGDVRNLQSVRDAMHGVDYIFHAAALKQVPSCEFFPMEAVRTNVLGTDNVLHAAIDEGVDRVVCLSTDKAAYPINAMGKSKAMMESIIYANARNGAGRTTICCTRYGNVMCSRGSVIPLFIDRIRKGEPLTVTDPNMTRFLMNLDEAVDLVQFAFEHANPGDLFIQKAPASTIGDLAEAVQEVFGRVGTQVIGTRHGEKLYETLMTCEERLRAEDMGDYFRVACDSRDLNYDKFVVNGEVTTMADEAYTSHNTSRLDVAGTVEKIKTAEYVQLALEGREYEAVQ \
--emb_dir ./emb/ \
--truncation_seq_length 4096 \
--dataset_name rdrp_40_extend \
--dataset_type protein \
--task_type binary_class \
--model_type sefn \
--time_str 20230201140320 \
--step 100000 \
--threshold 0.5 \
--gpu_id 0
# using CPU(gpu_id=-1)
python predict_one_sample.py \
--protein_id protein_1 \
--sequence MTTSTAFTGKTLMITGGTGSFGNTVLKHFVHTDLAEIRIFSRDEKKQDDMRHRLQEKSPELADKVRFFIGDVRNLQSVRDAMHGVDYIFHAAALKQVPSCEFFPMEAVRTNVLGTDNVLHAAIDEGVDRVVCLSTDKAAYPINAMGKSKAMMESIIYANARNGAGRTTICCTRYGNVMCSRGSVIPLFIDRIRKGEPLTVTDPNMTRFLMNLDEAVDLVQFAFEHANPGDLFIQKAPASTIGDLAEAVQEVFGRVGTQVIGTRHGEKLYETLMTCEERLRAEDMGDYFRVACDSRDLNYDKFVVNGEVTTMADEAYTSHNTSRLDVAGTVEKIKTAEYVQLALEGREYEAVQ \
--emb_dir ./emb/ \
--truncation_seq_length 4096 \
--dataset_name rdrp_40_extend \
--dataset_type protein \
--task_type binary_class \
--model_type sefn \
--time_str 20230201140320 \
--step 100000 \
--threshold 0.5 \
--gpu_id -1
-
--protein_id
str, the protein id. -
--sequence
str, the protein sequence. -
--truncation_seq_length
int, truncate sequences longer than the given value. Recommended values: 4096, 2048, 1984, 1792, 1534, 1280, 1152, 1024, default: 4096. -
--emb_dir(optional)
path, the saved dirpath of the protein predicted embedding matrix or vector during prediction, optional. -
--dataset_name
str, the dataset name for building of our trained model(rdrp_40_extend). -
--dataset_type
str, the dataset type for building of our trained model(protein). -
--task_type
str, the task type for building of our trained model(binary_class). -
--model_type
str, the model name for building of our trained model(sefn). -
--time_str
str, the running time string(yyyymmddHimiss) for building of our trained model(20230201140320). -
--step
int, the training global step of model finalization(100000). -
--threshold
float, sigmoid threshold for binary-class or multi-label classification, None for multi-class classification, default: 0.5. -
--gpu_id: int, the gpu id to use(-1 for cpu).
-
--torch_hub_dir(optional):
str, the torch hub dir path for saving pretrained model(default:~/.cache/torch/hub/)
2) Prediction from many sequences
the samples are in *.fasta, sample by sample prediction.
-
--fasta_file
str, the samples fasta file. -
--save_file
str, file path, save the predicted results into the file. -
--print_per_number
int, print progress information for every number of samples completed, default: 100.
cd LucaProt/src/prediction/
sh run_predict_many_samples.sh
Or:
cd LucaProt/src/
# using GPU(cuda=0)
export CUDA_VISIBLE_DEVICES="0,1,2,3"
python predict_many_samples.py \
--fasta_file ../data/rdrp/test/test.fasta \
--save_file ../result/rdrp/test/test_result.csv \
--emb_dir ../emb/ \
--truncation_seq_length 4096 \
--dataset_name rdrp_40_extend \
--dataset_type protein \
--task_type binary_class \
--model_type sefn \
--time_str 20230201140320 \
--step 100000 \
--threshold 0.5 \
--print_per_number 10 \
--gpu_id 0
# using CPU(gpu_id=-1)
python predict_many_samples.py \
--fasta_file ../data/rdrp/test/test.fasta \
--save_file ../result/rdrp/test/test_result.csv \
--emb_dir ../emb/ \
--truncation_seq_length 4096 \
--dataset_name rdrp_40_extend \
--dataset_type protein \
--task_type binary_class \
--model_type sefn \
--time_str 20230201140320 \
--step 100000 \
--threshold 0.5 \
--print_per_number 10 \
--gpu_id -1
3) Prediction from the file(embedding file exists in advance)
The test data (small and real) is in demo.csv, where the 7th column of each line is the filename of the structural embedding information prepared in advance.
And the structural embedding files store in embs.
The test data includes 50 viral-RdRPs and 50 non-viral RdRPs.
cd LucaProt/src/prediction/
sh run_predict_from_file.sh
Or:
cd LucaProt/src/
# using GPU(cuda=0)
export CUDA_VISIBLE_DEVICES="0,1,2,3"
python predict.py \
--data_path ../data/rdrp/demo/demo.csv \
--emb_dir ../data/rdrp/demo/embs/esm2_t36_3B_UR50D \
--dataset_name rdrp_40_extend \
--dataset_type protein \
--task_type binary_class \
--model_type sefn \
--time_str 20230201140320 \
--step 100000 \
--evaluate \
--threshold 0.5 \
--batch_size 16 \
--print_per_batch 100 \
--gpu_id 0
# using CPU(gpu_id=-1)
python predict.py \
--data_path ../data/rdrp/demo/demo.csv \
--emb_dir ../data/rdrp/demo/embs/esm2_t36_3B_UR50D \
--dataset_name rdrp_40_exten
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
API
A learning and reflection platform designed to cultivate clarity, resilience, and antifragile thinking in an uncertain world.
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
sec-edgar-agentkit
10AI agent toolkit for accessing and analyzing SEC EDGAR filing data. Build intelligent agents with LangChain, MCP-use, Gradio, Dify, and smolagents to analyze financial statements, insider trading, and company filings.
