SELFormer
SELFormer: Molecular Representation Learning via SELFIES Language Models
Install / Use
/learn @HUBioDataLab/SELFormerREADME
SELFormer: Molecular Representation Learning via SELFIES Language Models
<!-- omit in toc -->Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing (NLP) algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose; however, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model that utilizes a 100% valid, compact and expressive notation, SELFIES, as input, in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based chemical language models, on predicting aqueous solubility of molecules and adverse drug reactions. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models. Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features.
<img width="650" alt="Figure1_selformer_architecture" src="https://user-images.githubusercontent.com/13165170/229302081-94951d41-6f35-4f0f-a6dc-8c5914984f25.png">Figure. The schematic representation of the SELFormer architecture and the experiments conducted. Left: the self-supervised pre-training utilizes the transformer encoder module via masked language modeling for learning concise and informative representations of small molecules encoded by their SELFIES notation. Right: the pre-trained model has been fine-tuned independently on numerous molecular property-based classification and regression tasks.
<br/>The Architecture of SELFormer
SELFormer is built on the RoBERTa transformer architecture, which utilizes the same architecture as BERT, but with certain modifications that have been found to improve model performance or provide other benefits. One such modification is the use of byte-level Byte-Pair Encoding (BPE) for tokenization instead of character-level BPE. Another one is that, RoBERTa is pre-trained exclusively on the masked language modeling (MLM) objective while disregarding the next sentence prediction (NSP) task. SELFormer has (i) self-supervised pre-trained models that utilize the transformer encoder module for learning concise and informative representations of small molecules encoded by their SELFIES notation, and (ii) supervised classification/regression models which use the pre-treined model as base and fine-tune on numerous classification- and regression-based molecular property prediction tasks.
Our pre-trained encoder models are implemented as "RobertaMaskedLM" and fine-tuning models as "RobertaForSequenceClassification". For the fine-tuning process, the SELFormer architecture includes the pre-trained RoBERTa model as its base, and "RobertaClassificationHead" class as the following layers (for classification and regression). "RobertaClassificationHead" class consists of a dropout layer, a dense layer, tanh activation function, a dropout layer, and a final linear layer. We forward the sequence output of the pre-trained RoBERTa base model to the classifier during the fine-tuning process.
<br/>Getting Started
We highly recommend the Conda platform for installing dependencies. Following the installation of Conda, please create and activate an environment with dependencies as defined below:
conda create -n SELFormer_env
conda activate SELFormer_env
conda env update --file data/requirements.yml
<br/>
Generating Molecule Embeddings Using Pre-trained Models
Pre-trained SELFormer models are available for download here. Embeddings of all molecules from CHEMBL30 and CHEMBL33 that are generated by our best performing model are available here.
You can also generate embeddings for your own dataset using the pre-trained models. To do so, you will need SELFIES notations of your molecules. You can use the command below to generate SELFIES notations for your SMILES dataset.
If you want to reproduce our code for generating embeddings of CHEMBL30 dataset, you can unzip molecule_dataset_smiles.zip and/or molecule_dataset_selfies.zip files in the data directory and use them as input SMILES and SELFIES datasets, respectively.
python3 generate_selfies.py --smiles_dataset=data/molecule_dataset_smiles.txt --selfies_dataset=data/molecule_dataset_selfies.csv
- --smiles_dataset: Path of the input SMILES dataset.
- --selfies_dataset: Path of the output SELFIES dataset.
To generate embeddings for the SELFIES molecule dataset using a pre-trained model, please run the following command:
python3 produce_embeddings.py --selfies_dataset=data/molecule_dataset_selfies.csv --model_file=data/pretrained_models/SELFormer --embed_file=data/embeddings.csv
- --selfies_dataset: Path of the input SELFIES dataset.
- --model_file: Path of the pretrained model to be used.
- --embed_file: Path of the output embeddings file.
Generating Embeddings Using Pre-trained Models for MoleculeNet Dataset Molecules
The embeddings generated by our best performing pre-trained model for MoleculeNet data can be directly downloaded here.
You can also re-generate these embeddings using the command below.
python3 get_moleculenet_embeddings.py --dataset_path=data/finetuning_datasets --model_file=data/pretrained_models/SELFormer
- --dataset_path: Path of the directory containing the MoleculeNet datasets.
- --model_file: Path of the pretrained model to be used.
Training and Evaluating Models
Pre-Training
To pre-train a model, please run the command below. If you have a SELFIES dataset, you can use it directly by giving the path of the dataset to --selfies_dataset. If you have a SMILES dataset, you can give the path of the dataset to --smiles_dataset and the SELFIES representations will be created at the path given to --selfies_dataset.
<br/>python3 train_pretraining_model.py --smiles_dataset=data/molecule_dataset_smiles.txt --selfies_dataset=data/molecule_dataset_selfies.csv --prepared_data_path=data/selfies_data.txt --bpe_path=data/BPETokenizer --roberta_fast_tokenizer_path=data/RobertaFastTokenizer --hyperparameters_path=data/pretraining_hyperparameters.yml --subset_size=100000
- --smiles_dataset: Path of the SMILES dataset. It is required if --selfies_dataset does not exist (optional).
- --selfies_dataset: Path of the SELFIES dataset. If a SELFIES dataset does not exist, it will be created at the given path using the --smiles_dataset. If it exists, SELFIES dataset will be used directly (required).
- --prepared_data_path: Path of the intermediate file that will be created during pre-training. It will be used for tokenization. If it does not exist, it will be created at the given path (required).
- --bpe_path: Path of the BPE tokenizer. If it does not exist, it will be created at the given path (required).
- --roberta_fast_tokenizer_path: Path of the RobertaTokenizerFast tokenizer. If it does not exist, it will be created at the given path (required).
- --hyperparameters_path: Path of the yaml file that contains the hyperparameter sets to be tested. Note that these sets will be tested one by one and not in parallel. Example file is available at /data/pretraining_hyperparameters.yml (required).
- --subset_size: The size of the subset of the dataset that will be used for pre-training. By default, the whole dataset will be used (optional).
Fine-tuning on Molecular Property Prediction
You can use commands below to fine-tune a pre-trained model for various molecular property prediction tasks. These commands are utilized to handle datasets containing SMILES representations of molecules. SMILES representations should be stored in a column with a header named "smiles". You can see the example datasets in the data/finetuning_datasets directory.
<br/>Binary Classification Tasks
To fine-tune a pre-trained model on a binary classification dataset, please run the command below.
python3 train_classification_model.py --model=data/saved_models/SELFormer --tokenizer=data/RobertaFastTokenizer --dataset=data/finetuning_datasets/classification/bbbp/bbbp.csv --save_to=data/finetuned_models/SELFormer_bbbp_classification --target_column_id=1 --use_scaffold=1 --train_batch_size=16 --validation_batch_size=8 --num_epochs=25 --lr=5e-5 --wd=0
- --model: Directory of the pre-trained model (required)
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
399Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
18.7kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
