Apetokenizer
Tokenizer for chemnical SMILES and SELFIES for use in transformers models.
Install / Use
/learn @mikemayuare/ApetokenizerREADME
APE Tokenizer
APE Tokenizer (Atom Pair Encoding Tokenizer) is a tokenizer designed to handle SMILES and SELFIES molecular representations. It works similarly to BPE (Byte Pair Encoding), while ensuring that tokens preserve chemical information, making it ideal for molecular data. This tokenizer is fully compatible with the Hugging Face transformers library and can be easily integrated into any model that uses tokenizers.
Citation
Leon, M., Perezhohin, Y., Peres, F. et al. Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling. Sci Rep 14, 25016 (2024). https://doi.org/10.1038/s41598-024-76440-8
Features
- Hugging Face
transformerscompatible: Integrates seamlessly with models in thetransformerslibrary. - Tokenizes both SMILES and SELFIES representations.
- Adds and manages special tokens such as
<pad>,<s>,</s>,<unk>, and<mask>. - Supports vocabulary management, tokenization, padding, and encoding with PyTorch and TensorFlow.
- Trains a vocabulary from a corpus of molecular sequences.
- Provides saving/loading functionalities for vocabularies and tokenizer states.
Installation
To use the APE Tokenizer in your project, clone this repository and install the required dependencies:
git clone https://github.com/your-username/ape-tokenizer.git
cd ape-tokenizer
You’ll also need to install the Hugging Face transformers library:
pip install transformers
Usage
Using APE Tokenizer with transformers
You can easily integrate the APE Tokenizer with Hugging Face's transformers library. Here’s an example of tokenizing a SMILES sequence and using it with a model:
from ape_tokenizer import APETokenizer
# Initialize the tokenizer
tokenizer = APETokenizer()
tokenizer.load_vocabulary("path_to_vocabulary.json")
# Example SMILES string
smiles = "CCO"
# Tokenize the SMILES string
encoded = tokenizer(smiles, add_special_tokens=False)
# Hugging Face compatible input
from transformers import AutoModelforSequenceClassification
# Load a pre-trained model (for demonstration purposes)
model = AutoModelforSequenceClassification.from_pretrained("mikemayuare/SELFY-BPE-BBBP")
# Use the tokenized input in the model
outputs = model(input_ids=encoded['input_ids'], attention_mask=encoded['attention_mask'])
print(outputs)
Loading a Vocabulary
To load a saved vocabulary from a JSON file, use the load_vocabulary method:
from ape_tokenizer import APETokenizer
# Initialize the tokenizer
tokenizer = APETokenizer()
# Load the vocabulary from a JSON file
tokenizer.load_vocabulary("path_to_vocabulary.json")
# Now the tokenizer is ready to tokenize sequences
smiles = "CCO"
encoded = tokenizer(smiles, add_special_tokens=False)
print(encoded)
Tokenizing a DatasetDict
def preprocess(examples):
example = tokenizer(
examples["selfies"],
add_special_tokens=False,
)
return example
tokenized_dataset = dataset.map(preprocess)
Saving and Loading a Vocabulary
To save the current tokenizer vocabulary and state to a JSON file:
# Save the vocabulary and training state to JSON
tokenizer.save_vocabulary("path_to_save_vocabulary.json")
Training a New Tokenizer
You can train the tokenizer from a corpus of molecular sequences (SMILES or SELFIES) using the train method. The trained tokenizer can then be saved and reused with transformers.
# Example corpus of molecular sequences
corpus = ["CCO", "C=O", "CCC", "CCN"]
# Train the tokenizer with the corpus
tokenizer.train(corpus, max_vocab_size=5000, min_freq_for_merge=2000, save_checkpoint=True, checkpoint_path="./checkpoints")
# Save the trained vocabulary to a file
tokenizer.save_vocabulary("trained_vocabulary.json")
Padding and Attention Mask Generation
Compatible with DataCollatorWithPadding class.
Compatibility with Hugging Face transformers
APE Tokenizer is fully compatible with the transformers library. Here are the key methods to integrate APE Tokenizer with your transformer models:
Integration Methods
-
__call__(text, padding=False, max_length=None, add_special_tokens=False, return_tensors=None): Tokenizes the input text and returns the tokenized sequences (compatible with models in thetransformerslibrary). -
train(corpus, max_vocab_size=5000, min_freq_for_merge=2000, save_checkpoint=False, checkpoint_path="checkpoints", checkpoint_interval=500): Trains the tokenizer on a given corpus of molecular sequences. -
save_vocabulary(file_path): Saves the current vocabulary to a JSON file. -
load_vocabulary(file_path): Loads a vocabulary from a JSON file. -
encode(text, padding=False, max_length=None, add_special_tokens=False): Encodes a given text into a sequence of token IDs. -
pad(batch, padding=False, return_tensors=None): Pads a batch of sequences to a common length for model input. -
convert_tokens_to_ids(tokens): Converts a list of tokens to their respective IDs. -
convert_ids_to_tokens(token_ids): Converts a list of token IDs back to tokens.
Example
Here’s a complete example that shows how to train the tokenizer, tokenize a sequence, and use it with a Hugging Face transformer model:
from ape_tokenizer import APETokenizer
from transformers import BertModel
# Initialize the tokenizer and train it
tokenizer = APETokenizer()
corpus = ["CCO", "C=O", "CCC", "CCN"]
tokenizer.train(corpus, max_vocab_size=100)
# Tokenize a SMILES string
smiles = "CCO"
encoded = tokenizer(smiles, add_special_tokens=True, return_tensors="pt")
print("Encoded:", encoded)
# Load a pre-trained BERT model (for demonstration purposes)
model = BertModel.from_pretrained("bert-base-uncased")
# Use the tokenized input in the model
outputs = model(input_ids=encoded['input_ids'], attention_mask=encoded['attention_mask'])
# Model outputs
print(outputs)
License
This project is licensed under the MIT License. See the LICENSE file for more details.
