SkillAgentSearch skills...

LucaProt

LucaProt: A novel deep learning framework that incorporates protein amino acid sequence and structural information to predict protein function.

Install / Use

/learn @alibaba/LucaProt

README

LucaProt

LucaProt(DeepProtFunc) is an open source project developed by Alibaba and licensed under the Apache License (Version 2.0).

This product contains various third-party components under other open source licenses.
See the NOTICE file for more information.

Notice:
This project provides the Python dependency environment installation file, installation commands, and the running command of the trained LucaProt model for inference or prediction, which can be found in this repository. These models are compatible with Linux, Mac OS, and Windows systems, supporting both CPU and GPU configurations for inference tasks.

TimeLine

  • 2025-05-01
    Add the deployable web application (based on flask, in src/app/)
    Start the service on the server,

    cd LucaProt/src/app  
    # modify the service port in app.py    
    python app.py     
    

    Then the service can be accessed using the browser on the client: http://${server_ip}:8000

  • 2025-04-17:
    Add the post-processing workflow to classify the viral RdRPs predicted by LucaProt into our 180 supergroups or novel supergroups.
    (Guidance listed in PostProcessingWorkflow.md or PostProcessingWorkflow_zh.mdof this project).

<!-- * **2024-09-24<img src="https://img.shields.io/badge/🔥-orange" alt="Hot Badge" />:** A free CPU version of `LucaProt Server` is available online (https://lucaprot.org). -->
  • 2024-09-01:
    Optimize inference and prediction code to run on GPU with small graphics memory, such as A10.
<!-- ## LucaProt Server<img src="https://img.shields.io/badge/🔥-orange" alt="Hot Badge" /> LucaProt Server(CPU) is available at: https://lucaprot.org. Limit inference to a maximum of 100 sequences at a time. The GPU version will come soon. <center> <img src="pics/lucaprot_server.png" alt="LucaProt Server" width="50%" height="50%"/> LucaProt Server </center> -->

Introduction

LucaProt: A novel deep learning framework that incorporates protein amino acid sequence and structural information to predict protein function.

1. Model

1) Model Introduction

We developed a new deep learning model, namely, Deep Sequential and Structural Information Fusion Network for Proteins Function Prediction (DeepProtFunc/LucaProt), which takes into account protein sequence and structural information to facilitate the accurate annotation of protein function.

Here, we applied LucaProt to identify viral RdRP.

2) Model Architecture

We treat protein function prediction as a classification problem. For example, viral RdRP identification is a binary-class classification task, and protein general function annotation is a multi-label classification task. The model includes five modules: Input, Tokenizer, Encoder, Pooling, and Output. Its architecture is shown in Figure 1.

<center> <img src="pics/LucaProt.png"/>

Figure 1 The Architecture of LucaProt

</center>

3) Model Input/Output

Use the amino acid letter sequence as the input of our model. The model outputs the function label of the input protein, which is a single tag (binary-class classification or multi-class classification) or a set of tags (multi-label classification).

2. Dependence

System: Ubuntu 20.04.5 LTS
Python: 3.9.13
Download anaconda: <a href="https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh" title="anaconda"> anaconda </a>
Cuda: <a href="https://developer.nvidia.com/cuda-11-7-0-download-archive" title="cuda11.7 (torch==1.13.1)"> cuda11.7 (torch==1.13.1)</a>

# Select 'YES' during installation for initializing the conda environment  
sh Anaconda3-2022.10-Linux-x86_64.sh  
# Source the environment
source ~/.bashrc  
# Verification
conda  
# Install env and python 3.9.13   
conda create -n lucaprot python=3.9.13    
# activate env
conda activate lucaprot  
# Install git      
sudo apt-get update         
sudo apt install git-all

# Enter the project   
cd LucaProt     

# Install
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple        

3. Inference

You can simply use this project to infer or predict for unknown sequences.

1) Prediction from one sequence

cd LucaProt/src/prediction/ 
sh run_predict_one_sample.sh

Note: the embedding matrix of the sample is real-time predictive.

Or:

cd LucaProt/src/

# using GPU(cuda=0)    
export CUDA_VISIBLE_DEVICES="0,1,2,3"
python predict_one_sample.py \
    --protein_id protein_1 \
    --sequence MTTSTAFTGKTLMITGGTGSFGNTVLKHFVHTDLAEIRIFSRDEKKQDDMRHRLQEKSPELADKVRFFIGDVRNLQSVRDAMHGVDYIFHAAALKQVPSCEFFPMEAVRTNVLGTDNVLHAAIDEGVDRVVCLSTDKAAYPINAMGKSKAMMESIIYANARNGAGRTTICCTRYGNVMCSRGSVIPLFIDRIRKGEPLTVTDPNMTRFLMNLDEAVDLVQFAFEHANPGDLFIQKAPASTIGDLAEAVQEVFGRVGTQVIGTRHGEKLYETLMTCEERLRAEDMGDYFRVACDSRDLNYDKFVVNGEVTTMADEAYTSHNTSRLDVAGTVEKIKTAEYVQLALEGREYEAVQ	\
    --emb_dir ./emb/ \
    --truncation_seq_length 4096 \
    --dataset_name rdrp_40_extend \
    --dataset_type protein \
    --task_type binary_class \
    --model_type sefn \
    --time_str 20230201140320 \
    --step 100000 \
    --threshold 0.5 \
    --gpu_id 0
  
# using CPU(gpu_id=-1)    
python predict_one_sample.py \
    --protein_id protein_1 \
    --sequence MTTSTAFTGKTLMITGGTGSFGNTVLKHFVHTDLAEIRIFSRDEKKQDDMRHRLQEKSPELADKVRFFIGDVRNLQSVRDAMHGVDYIFHAAALKQVPSCEFFPMEAVRTNVLGTDNVLHAAIDEGVDRVVCLSTDKAAYPINAMGKSKAMMESIIYANARNGAGRTTICCTRYGNVMCSRGSVIPLFIDRIRKGEPLTVTDPNMTRFLMNLDEAVDLVQFAFEHANPGDLFIQKAPASTIGDLAEAVQEVFGRVGTQVIGTRHGEKLYETLMTCEERLRAEDMGDYFRVACDSRDLNYDKFVVNGEVTTMADEAYTSHNTSRLDVAGTVEKIKTAEYVQLALEGREYEAVQ	\
    --emb_dir ./emb/ \
    --truncation_seq_length 4096 \
    --dataset_name rdrp_40_extend \
    --dataset_type protein \
    --task_type binary_class \
    --model_type sefn \
    --time_str 20230201140320 \
    --step 100000 \
    --threshold 0.5 \
    --gpu_id -1
  • --protein_id
    str, the protein id.

  • --sequence
    str, the protein sequence.

  • --truncation_seq_length
    int, truncate sequences longer than the given value. Recommended values: 4096, 2048, 1984, 1792, 1534, 1280, 1152, 1024, default: 4096.

  • --emb_dir(optional)
    path, the saved dirpath of the protein predicted embedding matrix or vector during prediction, optional.

  • --dataset_name
    str, the dataset name for building of our trained model(rdrp_40_extend).

  • --dataset_type
    str, the dataset type for building of our trained model(protein).

  • --task_type
    str, the task type for building of our trained model(binary_class).

  • --model_type
    str, the model name for building of our trained model(sefn).

  • --time_str
    str, the running time string(yyyymmddHimiss) for building of our trained model(20230201140320).

  • --step
    int, the training global step of model finalization(100000).

  • --threshold
    float, sigmoid threshold for binary-class or multi-label classification, None for multi-class classification, default: 0.5.

  • --gpu_id: int, the gpu id to use(-1 for cpu).

  • --torch_hub_dir(optional):
    str, the torch hub dir path for saving pretrained model(default: ~/.cache/torch/hub/)

2) Prediction from many sequences

the samples are in *.fasta, sample by sample prediction.

  • --fasta_file
    str, the samples fasta file.

  • --save_file
    str, file path, save the predicted results into the file.

  • --print_per_number
    int, print progress information for every number of samples completed, default: 100.

cd LucaProt/src/prediction/   
sh run_predict_many_samples.sh

Or:

cd LucaProt/src/

# using GPU(cuda=0)   
export CUDA_VISIBLE_DEVICES="0,1,2,3"  
python predict_many_samples.py \
	--fasta_file ../data/rdrp/test/test.fasta  \
	--save_file ../result/rdrp/test/test_result.csv  \
	--emb_dir ../emb/   \
	--truncation_seq_length 4096  \
	--dataset_name rdrp_40_extend  \
	--dataset_type protein     \
	--task_type binary_class     \
	--model_type sefn     \
	--time_str 20230201140320   \
	--step 100000  \
	--threshold 0.5 \
	--print_per_number 10 \
	--gpu_id 0
	

# using CPU(gpu_id=-1)               
python predict_many_samples.py \
	--fasta_file ../data/rdrp/test/test.fasta  \
	--save_file ../result/rdrp/test/test_result.csv  \
	--emb_dir ../emb/   \
	--truncation_seq_length 4096  \
	--dataset_name rdrp_40_extend  \
	--dataset_type protein     \
	--task_type binary_class     \
	--model_type sefn     \
	--time_str 20230201140320   \
	--step 100000  \
	--threshold 0.5 \
	--print_per_number 10 \
	--gpu_id -1

3) Prediction from the file(embedding file exists in advance)

The test data (small and real) is in demo.csv, where the 7th column of each line is the filename of the structural embedding information prepared in advance.
And the structural embedding files store in embs.

The test data includes 50 viral-RdRPs and 50 non-viral RdRPs.

cd LucaProt/src/prediction/   
sh run_predict_from_file.sh

Or:

cd LucaProt/src/

# using GPU(cuda=0)   
export CUDA_VISIBLE_DEVICES="0,1,2,3"
python predict.py \
    --data_path ../data/rdrp/demo/demo.csv \
    --emb_dir ../data/rdrp/demo/embs/esm2_t36_3B_UR50D \
    --dataset_name rdrp_40_extend \
    --dataset_type protein \
    --task_type binary_class \
    --model_type sefn \
    --time_str 20230201140320 \
    --step 100000 \
    --evaluate \
    --threshold 0.5 \
    --batch_size 16 \
    --print_per_batch 100 \
    --gpu_id 0 
    
# using CPU(gpu_id=-1)          
python predict.py \
    --data_path ../data/rdrp/demo/demo.csv \
    --emb_dir ../data/rdrp/demo/embs/esm2_t36_3B_UR50D \
    --dataset_name rdrp_40_exten

Related Skills

View on GitHub
GitHub Stars223
CategoryEducation
Updated1mo ago
Forks41

Languages

Python

Security Score

100/100

Audited on Feb 10, 2026

No findings