LucaOneTasks
The project of the downstream tasks based on LucaOne's Embedding.
Install / Use
/learn @LucaOne/LucaOneTasksREADME
Downstream Tasks of LucaOne
TimeLine
-
2025/12/31
LucaOne now supports the Hugging Face interface for further training.
It allows for various training modes, including using sequence-only inputs or injecting biological knowledge following the LucaOne framework. You can fine-tune the model for both sequence-level and token-level classification or regression tasks.
Please refer to the Hugging Face address: https://huggingface.co/collections/LucaGroup/lucaone, or thehuggingfacebranch of this repository.- Hugging Face Native: Full support for
AutoModel,AutoModelForMaskedLM,AutoModelForSequenceClassification,AutoModelForTokenClassification,AutoConfig, andAutoTokenizer. - Unified Architecture: Single model architecture handling multiple biological modalities.
- Task-Specific Heads:
LucaGPLMModel: For sequences embedding.LucaGPLMForMaskedLM: For pre-training and sequence recovery.LucaGPLMForSequenceClassification: For sequence-level tasks (e.g., protein family, solubility, or promoter prediction).LucaGPLMForTokenClassification: For residue-level tasks (e.g., secondary structure, binding sites, or post-translational modifications).
- Extensible: Easily adaptable to custom downstream tasks using the standard
transformersAPI.
- Hugging Face Native: Full support for
-
2025/12/26:
LucaOne now supports BF16 for embedding inference.
add parameter: --use_bf16 -
2025/08/15:
Huggingface
<a href='https://huggingface.co/LucaGroup'>https://huggingface.co/LucaGroup </a> -
2025/04/08:
- LucaOne
addcheckpoint=36000000forLucaOne
location: <a href='http://47.93.21.181/lucaone/TrainedCheckPoint/latest/models/lucaone/lucaone/checkpoint-step36000000/'>checkpoint-step36000000</a> - LucaOne-Gene
addcheckpoint=36800000forLucaOne-Gene(only trained usingDNAandRNA)
location: <a href='http://47.93.21.181/lucaone/TrainedCheckPoint/latest/models/lucaone/lucaone-gene/checkpoint-step36800000/'>checkpoint-step36800000</a> - LucaOne-Prot
addcheckpoint=30000000forLucaOne-Prot(only trained usingProtein)
location: <a href='http://47.93.21.181/lucaone/TrainedCheckPoint/latest/models/lucaone/lucaone-prot/checkpoint-step30000000/'>checkpoint-step30000000</a>
- LucaOne
-
2024/10/01: optimized embedding inference code:
src/llm/lucagplm/get_embedding.py -
2024/08/01: add
checkpoint=17600000, location: <a href='http://47.93.21.181/lucaone/TrainedCheckPoint/models/lucagplm/v2.0/token_level,span_level,seq_level,structure_level/lucaone_gplm/20231125113045/checkpoint-step17600000/'>checkpoint-step17600000</a>
This project will download the checkpoint automatically from our FTP according to the value of parameter:
- --llm_type
- --llm_version
- --llm_step
Embedding Recommendation
| --llm_type | --llm_version | --llm_step | Usage (seq_type) |
|:----------:|:--------------:|:------------------------------------:|:------------------------------------------------:|
| lucaone | lucaone | 36000000, 17600000, or 5600000 | both gene (i.e. DNA, RNA) and prot sequences |
| lucaone | lucaone-gene | 36800000 | only for gene (i.e. DNA, RNA) sequences |
| lucaone | lucaone-prot | 30000000 | only for prot sequence |
1. Networks
Three distinct networks correspond to three different types of inputs:
- LucaBase(Single)
- LucaPPI(Homogeneous Pair)
- LucaPPI2(Heterogeneous Pair)
Fig. 1 Downstream task network with three input types and results comparison of 8 ver- ification tasks.
</center>-
Central Dogma(Central Dogma of Molecular Biology)
Input: DNA + Protein(heterogeneous double sequence)
Network: LucaPPI2(src/ppi/models/LucaPPI2) -
SupKTax(Genus Taxonomy Annotation)
Input: DNA(single sequence)
Network: LucaBase(src/common/luca_base) -
GenusTax(SuperKingdom Taxonomy Annotation)
Input: DNA(single sequence)
Network: LucaBase(src/common/luca_base) -
SpeciesTax(Species Taxonomy Annotation)
Input: DNA(single sequence)
Network: LucaBase(src/common/luca_base) -
ProtLoc(Prokaryotic Protein Subcellular Location)
Input: Protein(single sequence)
Network: LucaBase(src/common/luca_base) -
ProtStab(Protein Stability)
Input: Protein(single sequence)
Network: LucaBase(src/common/luca_base) -
ncRNAFam(Non-coding RNA Family)
Input: RNA(single sequence)
Network: LucaBase(src/common/luca_base) -
InfA(Influenza A Antigenic Relationship Prediction)
Input: RNA + RNA(homogeneous double sequence)
Network: LucaPPI(src/ppi/models/LucaPPI) -
PPI(Protein-Protein Interaction)
Input: Protein + Protein(homogeneous double sequence)
Network: LucaPPI(src/ppi/models/LucaPPI) -
ncRPI(ncRNA-Protein Interactions)
Input: DNA + Protein(heterogeneous double sequence)
Network: LucaPPI2(src/ppi/models/LucaPPI2)
2. Environment Installation
step1: update git
1) centos
sudo yum update
sudo yum install git-all
2) ubuntu
sudo apt-get update
sudo apt install git-all
step2: install python 3.9
1) download anaconda3
wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
2) install conda
sh Anaconda3-2022.05-Linux-x86_64.sh
Notice: Select Yes to update ~/.bashrc
source ~/.bashrc
3) create a virtual environment: python=3.9.13
conda create -n lucaone_tasks python=3.9.13
4) activate lucaone_tasks
conda activate lucaone_tasks
step3: install other requirements
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
for DNABert2 Embedding
Notice: Need to switch the virtual environment
activate deactivate
conda create -n lucaone_tasks_dnabert2 python=3.9.13
conda activate lucaone_tasks_dnabert2
pip install -r requirements_dnabert2.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
3. Datasets
Downstream Tasks Dataset FTP: <a href='http://47.93.21.181/lucaone/DownstreamTasksDataset/dataset/'>Dataset for LucaOneTasks</a>
Copy the 10 datasets from <href> http://47.93.21.181/lucaone/DownstreamTasksDataset/dataset/* </href> into the directory ./dataset/
4. LucaOne Trained Checkpoint
Trained LucaOne Checkpoint FTP: <a href='http://47.93.21.181/lucaone/TrainedCheckPoint/latest'>TrainedCheckPoint for LucaOne</a>
Notice
The project will download automatically LucaOne Trained-CheckPoint from FTP.
When downloading automatically failed, you can manually download:
Copy the TrainedCheckPoint Files(models/ + logs/) from <href> http://47.93.21.181/lucaone/TrainedCheckPoint/* </href> into the directory ./llm/
5. Usage of LucaOne Embedding(can also use LucaOneApp project)
Methods of using embedding:
In this project, the sequence is embedded during the training downstream task(./src/encoder.py).
We can also embed the dataset and store into a predefined folder, then build and train the downstream network.
the script of embedding a dataset(./src/llm/lucagplm/get_embedding.py):
Suggestions and Instructions:
- Try to use a large GPU-memory machine for embedding reasoning, such as A100, H100, H200, etc., so that long sequences can be processed once.
LucaOne can process sequences of about3400in length at one time under A100; - For long sequences, LucaOne will do overlapped fragments in the sequence for embedding and finally merge them into a completed embedding matrix.
Please set--embedding_completeand--embedding_complete_seg_overlap; - If the GPU memory is not enough to process the longer sequence, it will use the CPU for embedding, so the speed will be reduced.
If your dataset is small, then you can set:--gpu_id -1; - If your dataset includes a lot of long sequences (more than 10,000 sequences), please set:
--embedding_complete,--embedding_complete_seg_overlap, and--embedding_fixed_len_a_time(represent the maximum length for embedding at one-time).
If the sequence length is greater than the value of--embedding_fixed_len_a_time, fragment embedding is performed based on this value, and finally, the merge is performed; otherwise, according to the actual length of the sequence; - If
--embedding_completeis not set, the code will truncate the sequence embedding according to the value of--truncation_seq_length; - For proteins, the length of most proteins is less than 1000; there are not many ultra-long protein sequences, so the value of
--embedding_fixed_len_a_timecan be set a large value or not be set; - For DNA, the DNA sequence of many tasks is very long; please set
--embedding_fixed_len_a_time.
The larger the amount of ultra-long sequence, the smaller value should be set, such as3400under A100.
If the GPU embedding fails to process the longer sequence, the CPU will be called.
When the amount of dataset is not large, the spent time will not be long; - For RNA, most RNA is not very long, so the processing method can be consistent with the protein, so the
--embedding_fixed_len_a_timecan be set a larger value or not be set. - You can set
--use_bf16for long sequences embedding;
1) the csv file format of input
Notice:
a. need to specify the column index of the sequence id(*id_id
