LucaProt

LucaProt: A novel deep learning framework that incorporates protein amino acid sequence and structural information to predict protein function.

Generate Convert Improve

Install / Use

/learn @alibaba/LucaProt

About this skill

Quality Score

0/100

README

LucaProt

LucaProt(DeepProtFunc) is an open source project developed by Alibaba and licensed under the Apache License (Version 2.0).

This product contains various third-party components under other open source licenses.
See the NOTICE file for more information.

Notice:
This project provides the Python dependency environment installation file, installation commands, and the running command of the trained LucaProt model for inference or prediction, which can be found in this repository. These models are compatible with Linux, Mac OS, and Windows systems, supporting both CPU and GPU configurations for inference tasks.

TimeLine

2025-05-01
Add the deployable web application (based on flask, in src/app/)
Start the service on the server,
```
cd LucaProt/src/app  
# modify the service port in app.py    
python app.py     
```
Then the service can be accessed using the browser on the client: http://${server_ip}:8000
2025-04-17:
Add the post-processing workflow to classify the viral RdRPs predicted by LucaProt into our 180 supergroups or novel supergroups.
(Guidance listed in PostProcessingWorkflow.md or PostProcessingWorkflow_zh.mdof this project).

2024-09-01:
Optimize inference and prediction code to run on GPU with small graphics memory, such as A10.

Introduction

LucaProt: A novel deep learning framework that incorporates protein amino acid sequence and structural information to predict protein function.

1. Model

1) Model Introduction

We developed a new deep learning model, namely, Deep Sequential and Structural Information Fusion Network for Proteins Function Prediction (DeepProtFunc/LucaProt), which takes into account protein sequence and structural information to facilitate the accurate annotation of protein function.

Here, we applied LucaProt to identify viral RdRP.

2) Model Architecture

We treat protein function prediction as a classification problem. For example, viral RdRP identification is a binary-class classification task, and protein general function annotation is a multi-label classification task. The model includes five modules: Input, Tokenizer, Encoder, Pooling, and Output. Its architecture is shown in Figure 1.

Figure 1 The Architecture of LucaProt

</center>

3) Model Input/Output

Use the amino acid letter sequence as the input of our model. The model outputs the function label of the input protein, which is a single tag (binary-class classification or multi-class classification) or a set of tags (multi-label classification).

2. Dependence

System: Ubuntu 20.04.5 LTS
Python: 3.9.13
Download anaconda: <a href="https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh" title="anaconda"> anaconda </a>
Cuda: <a href="https://developer.nvidia.com/cuda-11-7-0-download-archive" title="cuda11.7 (torch==1.13.1)"> cuda11.7 (torch==1.13.1)</a>

# Select 'YES' during installation for initializing the conda environment  
sh Anaconda3-2022.10-Linux-x86_64.sh  
# Source the environment
source ~/.bashrc  
# Verification
conda  
# Install env and python 3.9.13   
conda create -n lucaprot python=3.9.13    
# activate env
conda activate lucaprot  
# Install git      
sudo apt-get update         
sudo apt install git-all

# Enter the project   
cd LucaProt     

# Install
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

3. Inference

You can simply use this project to infer or predict for unknown sequences.

1) Prediction from one sequence

cd LucaProt/src/prediction/ 
sh run_predict_one_sample.sh

Note: the embedding matrix of the sample is real-time predictive.

Or:

cd LucaProt/src/

# using GPU(cuda=0)    
export CUDA_VISIBLE_DEVICES="0,1,2,3"
python predict_one_sample.py \
    --protein_id protein_1 \
    --sequence MTTSTAFTGKTLMITGGTGSFGNTVLKHFVHTDLAEIRIFSRDEKKQDDMRHRLQEKSPELADKVRFFIGDVRNLQSVRDAMHGVDYIFHAAALKQVPSCEFFPMEAVRTNVLGTDNVLHAAIDEGVDRVVCLSTDKAAYPINAMGKSKAMMESIIYANARNGAGRTTICCTRYGNVMCSRGSVIPLFIDRIRKGEPLTVTDPNMTRFLMNLDEAVDLVQFAFEHANPGDLFIQKAPASTIGDLAEAVQEVFGRVGTQVIGTRHGEKLYETLMTCEERLRAEDMGDYFRVACDSRDLNYDKFVVNGEVTTMADEAYTSHNTSRLDVAGTVEKIKTAEYVQLALEGREYEAVQ	\
    --emb_dir ./emb/ \
    --truncation_seq_length 4096 \
    --dataset_name rdrp_40_extend \
    --dataset_type protein \
    --task_type binary_class \
    --model_type sefn \
    --time_str 20230201140320 \
    --step 100000 \
    --threshold 0.5 \
    --gpu_id 0
  
# using CPU(gpu_id=-1)    
python predict_one_sample.py \
    --protein_id protein_1 \
    --sequence MTTSTAFTGKTLMITGGTGSFGNTVLKHFVHTDLAEIRIFSRDEKKQDDMRHRLQEKSPELADKVRFFIGDVRNLQSVRDAMHGVDYIFHAAALKQVPSCEFFPMEAVRTNVLGTDNVLHAAIDEGVDRVVCLSTDKAAYPINAMGKSKAMMESIIYANARNGAGRTTICCTRYGNVMCSRGSVIPLFIDRIRKGEPLTVTDPNMTRFLMNLDEAVDLVQFAFEHANPGDLFIQKAPASTIGDLAEAVQEVFGRVGTQVIGTRHGEKLYETLMTCEERLRAEDMGDYFRVACDSRDLNYDKFVVNGEVTTMADEAYTSHNTSRLDVAGTVEKIKTAEYVQLALEGREYEAVQ	\
    --emb_dir ./emb/ \
    --truncation_seq_length 4096 \
    --dataset_name rdrp_40_extend \
    --dataset_type protein \
    --task_type binary_class \
    --model_type sefn \
    --time_str 20230201140320 \
    --step 100000 \
    --threshold 0.5 \
    --gpu_id -1

--protein_id
str, the protein id.
--sequence
str, the protein sequence.
--truncation_seq_length
int, truncate sequences longer than the given value. Recommended values: 4096, 2048, 1984, 1792, 1534, 1280, 1152, 1024, default: 4096.
--emb_dir(optional)
path, the saved dirpath of the protein predicted embedding matrix or vector during prediction, optional.
--dataset_name
str, the dataset name for building of our trained model(rdrp_40_extend).
--dataset_type
str, the dataset type for building of our trained model(protein).
--task_type
str, the task type for building of our trained model(binary_class).
--model_type
str, the model name for building of our trained model(sefn).
--time_str
str, the running time string(yyyymmddHimiss) for building of our trained model(20230201140320).
--step
int, the training global step of model finalization(100000).
--threshold
float, sigmoid threshold for binary-class or multi-label classification, None for multi-class classification, default: 0.5.
--gpu_id: int, the gpu id to use(-1 for cpu).
--torch_hub_dir(optional):
str, the torch hub dir path for saving pretrained model(default: ~/.cache/torch/hub/)

2) Prediction from many sequences

the samples are in *.fasta, sample by sample prediction.

--fasta_file
str, the samples fasta file.
--save_file
str, file path, save the predicted results into the file.
--print_per_number
int, print progress information for every number of samples completed, default: 100.

cd LucaProt/src/prediction/   
sh run_predict_many_samples.sh

Or:

cd LucaProt/src/

# using GPU(cuda=0)   
export CUDA_VISIBLE_DEVICES="0,1,2,3"  
python predict_many_samples.py \
	--fasta_file ../data/rdrp/test/test.fasta  \
	--save_file ../result/rdrp/test/test_result.csv  \
	--emb_dir ../emb/   \
	--truncation_seq_length 4096  \
	--dataset_name rdrp_40_extend  \
	--dataset_type protein     \
	--task_type binary_class     \
	--model_type sefn     \
	--time_str 20230201140320   \
	--step 100000  \
	--threshold 0.5 \
	--print_per_number 10 \
	--gpu_id 0
	

# using CPU(gpu_id=-1)               
python predict_many_samples.py \
	--fasta_file ../data/rdrp/test/test.fasta  \
	--save_file ../result/rdrp/test/test_result.csv  \
	--emb_dir ../emb/   \
	--truncation_seq_length 4096  \
	--dataset_name rdrp_40_extend  \
	--dataset_type protein     \
	--task_type binary_class     \
	--model_type sefn     \
	--time_str 20230201140320   \
	--step 100000  \
	--threshold 0.5 \
	--print_per_number 10 \
	--gpu_id -1

3) Prediction from the file(embedding file exists in advance)

The test data (small and real) is in demo.csv, where the 7th column of each line is the filename of the structural embedding information prepared in advance.
And the structural embedding files store in embs.

The test data includes 50 viral-RdRPs and 50 non-viral RdRPs.

cd LucaProt/src/prediction/   
sh run_predict_from_file.sh

Or:

cd LucaProt/src/

# using GPU(cuda=0)   
export CUDA_VISIBLE_DEVICES="0,1,2,3"
python predict.py \
    --data_path ../data/rdrp/demo/demo.csv \
    --emb_dir ../data/rdrp/demo/embs/esm2_t36_3B_UR50D \
    --dataset_name rdrp_40_extend \
    --dataset_type protein \
    --task_type binary_class \
    --model_type sefn \
    --time_str 20230201140320 \
    --step 100000 \
    --evaluate \
    --threshold 0.5 \
    --batch_size 16 \
    --print_per_batch 100 \
    --gpu_id 0 
    
# using CPU(gpu_id=-1)          
python predict.py \
    --data_path ../data/rdrp/demo/demo.csv \
    --emb_dir ../data/rdrp/demo/embs/esm2_t36_3B_UR50D \
    --dataset_name rdrp_40_exten

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

API

A learning and reflection platform designed to cultivate clarity, resilience, and antifragile thinking in an uncertain world.

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

sec-edgar-agentkit

AI agent toolkit for accessing and analyzing SEC EDGAR filing data. Build intelligent agents with LangChain, MCP-use, Gradio, Dify, and smolagents to analyze financial statements, insider trading, and company filings.

alibaba

View profile

View on GitHub

GitHub Stars223

CategoryEducation

Updated1mo ago

Forks41

alibaba/LucaProt

Languages

Python

Security Score

100/100

Audited on Feb 10, 2026

No findings