Downstream Tasks of LucaOne

TimeLine

2025/12/31
LucaOne now supports the Hugging Face interface for further training.
It allows for various training modes, including using sequence-only inputs or injecting biological knowledge following the LucaOne framework. You can fine-tune the model for both sequence-level and token-level classification or regression tasks.
Please refer to the Hugging Face address: https://huggingface.co/collections/LucaGroup/lucaone, or the huggingface branch of this repository.
- Hugging Face Native: Full support for AutoModel, AutoModelForMaskedLM, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoConfig, and AutoTokenizer.
- Unified Architecture: Single model architecture handling multiple biological modalities.
- Task-Specific Heads:
  - LucaGPLMModel: For sequences embedding.
  - LucaGPLMForMaskedLM: For pre-training and sequence recovery.
  - LucaGPLMForSequenceClassification: For sequence-level tasks (e.g., protein family, solubility, or promoter prediction).
  - LucaGPLMForTokenClassification: For residue-level tasks (e.g., secondary structure, binding sites, or post-translational modifications).
- Extensible: Easily adaptable to custom downstream tasks using the standard transformers API.
2025/12/26:
LucaOne now supports BF16 for embedding inference.
add parameter: --use_bf16
2025/08/15:
Huggingface
<a href='https://huggingface.co/LucaGroup'>https://huggingface.co/LucaGroup </a>
2025/04/08:
- LucaOne
  add checkpoint=36000000 for LucaOne
  location: <a href='http://47.93.21.181/lucaone/TrainedCheckPoint/latest/models/lucaone/lucaone/checkpoint-step36000000/'>checkpoint-step36000000</a>
- LucaOne-Gene
  add checkpoint=36800000 for LucaOne-Gene (only trained using DNA and RNA)
  location: <a href='http://47.93.21.181/lucaone/TrainedCheckPoint/latest/models/lucaone/lucaone-gene/checkpoint-step36800000/'>checkpoint-step36800000</a>
- LucaOne-Prot
  add checkpoint=30000000 for LucaOne-Prot (only trained using Protein)
  location: <a href='http://47.93.21.181/lucaone/TrainedCheckPoint/latest/models/lucaone/lucaone-prot/checkpoint-step30000000/'>checkpoint-step30000000</a>
2024/10/01: optimized embedding inference code: src/llm/lucagplm/get_embedding.py
2024/08/01: add checkpoint=17600000, location: <a href='http://47.93.21.181/lucaone/TrainedCheckPoint/models/lucagplm/v2.0/token_level,span_level,seq_level,structure_level/lucaone_gplm/20231125113045/checkpoint-step17600000/'>checkpoint-step17600000</a>

This project will download the checkpoint automatically from our FTP according to the value of parameter:

--llm_type
--llm_version
--llm_step

Embedding Recommendation

1. Networks

Three distinct networks correspond to three different types of inputs:

LucaBase(Single)
LucaPPI(Homogeneous Pair)
LucaPPI2(Heterogeneous Pair)

Fig. 1 Downstream task network with three input types and results comparison of 8 ver- ification tasks.

</center>

Central Dogma(Central Dogma of Molecular Biology)
Input: DNA + Protein(heterogeneous double sequence)
Network: LucaPPI2(src/ppi/models/LucaPPI2)
SupKTax(Genus Taxonomy Annotation)
Input: DNA(single sequence)
Network: LucaBase(src/common/luca_base)
GenusTax(SuperKingdom Taxonomy Annotation)
Input: DNA(single sequence)
Network: LucaBase(src/common/luca_base)
SpeciesTax(Species Taxonomy Annotation)
Input: DNA(single sequence)
Network: LucaBase(src/common/luca_base)
ProtLoc(Prokaryotic Protein Subcellular Location)
Input: Protein(single sequence)
Network: LucaBase(src/common/luca_base)
ProtStab(Protein Stability)
Input: Protein(single sequence)
Network: LucaBase(src/common/luca_base)
ncRNAFam(Non-coding RNA Family)
Input: RNA(single sequence)
Network: LucaBase(src/common/luca_base)
InfA(Influenza A Antigenic Relationship Prediction)
Input: RNA + RNA(homogeneous double sequence)
Network: LucaPPI(src/ppi/models/LucaPPI)
PPI(Protein-Protein Interaction)
Input: Protein + Protein(homogeneous double sequence)
Network: LucaPPI(src/ppi/models/LucaPPI)
ncRPI(ncRNA-Protein Interactions)
Input: DNA + Protein(heterogeneous double sequence)
Network: LucaPPI2(src/ppi/models/LucaPPI2)

2. Environment Installation

step1: update git

1) centos

sudo yum update
sudo yum install git-all

2) ubuntu

sudo apt-get update
sudo apt install git-all

step2: install python 3.9

1) download anaconda3

wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh

2) install conda

sh Anaconda3-2022.05-Linux-x86_64.sh

Notice: Select Yes to update ~/.bashrc

source ~/.bashrc

3) create a virtual environment: python=3.9.13

conda create -n lucaone_tasks python=3.9.13

4) activate lucaone_tasks

conda activate lucaone_tasks

step3: install other requirements

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

for DNABert2 Embedding

Notice： Need to switch the virtual environment

activate deactivate

conda create -n lucaone_tasks_dnabert2 python=3.9.13

conda activate lucaone_tasks_dnabert2

pip install -r requirements_dnabert2.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

3. Datasets

Downstream Tasks Dataset FTP: <a href='http://47.93.21.181/lucaone/DownstreamTasksDataset/dataset/'>Dataset for LucaOneTasks</a>

Copy the 10 datasets from <href> http://47.93.21.181/lucaone/DownstreamTasksDataset/dataset/* </href> into the directory ./dataset/

4. LucaOne Trained Checkpoint

Trained LucaOne Checkpoint FTP: <a href='http://47.93.21.181/lucaone/TrainedCheckPoint/latest'>TrainedCheckPoint for LucaOne</a>

Notice
The project will download automatically LucaOne Trained-CheckPoint from FTP.

When downloading automatically failed, you can manually download:

Copy the TrainedCheckPoint Files(models/ + logs/) from <href> http://47.93.21.181/lucaone/TrainedCheckPoint/* </href> into the directory ./llm/

5. Usage of LucaOne Embedding(can also use LucaOneApp project)

Methods of using embedding:
In this project, the sequence is embedded during the training downstream task(./src/encoder.py).

We can also embed the dataset and store into a predefined folder, then build and train the downstream network.
the script of embedding a dataset(./src/llm/lucagplm/get_embedding.py):

Suggestions and Instructions:

Try to use a large GPU-memory machine for embedding reasoning, such as A100, H100, H200, etc., so that long sequences can be processed once.
LucaOne can process sequences of about 3400 in length at one time under A100;
For long sequences, LucaOne will do overlapped fragments in the sequence for embedding and finally merge them into a completed embedding matrix.
Please set --embedding_complete and --embedding_complete_seg_overlap;
If the GPU memory is not enough to process the longer sequence, it will use the CPU for embedding, so the speed will be reduced.
If your dataset is small, then you can set: --gpu_id -1;
If your dataset includes a lot of long sequences (more than 10,000 sequences), please set: --embedding_complete, --embedding_complete_seg_overlap, and --embedding_fixed_len_a_time (represent the maximum length for embedding at one-time).
If the sequence length is greater than the value of --embedding_fixed_len_a_time, fragment embedding is performed based on this value, and finally, the merge is performed; otherwise, according to the actual length of the sequence;
If --embedding_complete is not set, the code will truncate the sequence embedding according to the value of --truncation_seq_length;
For proteins, the length of most proteins is less than 1000; there are not many ultra-long protein sequences, so the value of --embedding_fixed_len_a_time can be set a large value or not be set;
For DNA, the DNA sequence of many tasks is very long; please set --embedding_fixed_len_a_time.
The larger the amount of ultra-long sequence, the smaller value should be set, such as 3400 under A100.
If the GPU embedding fails to process the longer sequence, the CPU will be called.
When the amount of dataset is not large, the spent time will not be long;
For RNA, most RNA is not very long, so the processing method can be consistent with the protein, so the --embedding_fixed_len_a_time can be set a larger value or not be set.
You can set --use_bf16 for long sequences embedding;

1) the csv file format of input

Notice:
a. need to specify the column index of the sequence id(*id_id

LucaOneTasks

Install / Use

README