Tkseem
Arabic Tokenization Library. It provides many tokenization algorithms.
Install / Use
/learn @ARBML/TkseemREADME
<p align="center">
<img src = "https://raw.githubusercontent.com/ARBML/tkseem/master/logo.png" width = "200px"/>
</p>
tkseem (تقسيم) is a tokenization library that encapsulates different approaches for tokenization and preprocessing of Arabic text.
Documentation
Please visit readthedocs for the full documentation.
Installation
pip install tkseem
Usage
Tokenization
<p align="center"> <img src = "https://raw.githubusercontent.com/ARBML/tkseem/master/logo_tokenizers.png"/> </p>import tkseem as tk
tokenizer = tk.WordTokenizer()
tokenizer.train('samples/data.txt')
tokenizer.tokenize("السلام عليكم")
tokenizer.encode("السلام عليكم")
tokenizer.decode([536, 829])
Caching
tokenizer.tokenize(open('data/raw/train.txt').read(), use_cache = True)
Save and Load
import tkseem as tk
tokenizer = tk.WordTokenizer()
tokenizer.train('samples/data.txt')
# save the model
tokenizer.save_model('vocab.pl')
# load the model
tokenizer = tk.WordTokenizer()
tokenizer.load_model('vocab.pl')
Model Agnostic
import tkseem as tk
import time
import seaborn as sns
import pandas as pd
def calc_time(fun):
start_time = time.time()
fun().train()
return time.time() - start_time
running_times = {}
running_times['Word'] = calc_time(tk.WordTokenizer)
running_times['SP'] = calc_time(tk.SentencePieceTokenizer)
running_times['Random'] = calc_time(tk.RandomTokenizer)
running_times['Disjoint'] = calc_time(tk.DisjointLetterTokenizer)
running_times['Char'] = calc_time(tk.CharacterTokenizer)
Notebooks
We show how to use tkseem to train some nlp models.
Citation
@misc{tkseem2020,
author = {Zaid Alyafeai and Maged Saeed},
title = {tkseem: A Tokenization Library for Arabic.},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ARBML/tkseem}}
}
Contribution
This is an open source project where we encourage contributions from the community.
License
MIT license.
