Bling Fire

Introduction

Hi, we are a team at Microsoft called Bling (Beyond Language Understanding), we help Bing be smarter. Here we wanted to share with all of you our FInite State machine and REgular expression manipulation library (FIRE). We use Fire for many linguistic operations inside Bing such as Tokenization, Multi-word expression matching, Unknown word-guessing, Stemming / Lemmatization just to mention a few.

Bling Fire Tokenizer Overview

Bling Fire Tokenizer provides state of the art performance for Natural Language text tokenization. Bling Fire supports the following tokenization algorithms:

Pattern-based tokenization
WordPiece tokenization
SentencePiece Unigram LM
SentencePiece BPE
Induced/learned syllabification patterns (identifies possible hyphenation points within a token)

Bling Fire provides uniform interface for working with all four algorithms so there is no difference for the client whether to use tokenizer for XLNET, BERT or your own custom model.

Model files describe the algorithms they are built for and are loaded on demand from external file. There are also two default models for NLTK-style tokenization and sentence breaking, which does not need to be loaded. The default tokenization model follows logic of NLTK, except hyphenated words are split and a few "errors" are fixed.

Normalization can be added to each model, but is optional.

Diffrences between algorithms are summarized here.

Bling Fire Tokenizer high level API designed in a way that it requires minimal or no configuration, or initialization, or additional files and is friendly for use from languages like Python, Ruby, Rust, C#, JavaScript (via WASM), etc.

We have precompiled some popular models and listed with the source code reference below:

| File Name | Models it should be used for | Algorithm | Source Code | |------------|---------------------------------------|----|----| | wbd.bin | Default Tokenization Model | Pattern-based | src | | sbd.bin | Default model for Sentence breaking | Pattern-based | src | | bert_base_tok.bin | BERT Base/Large | WordPiece | src | | bert_base_cased_tok.bin | BERT Base/Large Cased | WordPiece | src | | bert_chinese.bin | BERT Chinese | WordPiece | src | | bert_multi_cased.bin | BERT Multi Lingual Cased | WordPiece | src | | xlnet.bin | XLNET Tokenization Model | Unigram LM | src | | xlnet_nonorm.bin | XLNET Tokenization Model /wo normalization | Unigram LM | src | | bpe_example.bin | A model to test BPE tokenization | BPE | src | | xlm_roberta_base.bin | XLM Roberta Tokenization | Unigram LM | src | | laser(100k|250k|500k).bin | Trained on balanced by language WikiMatrix corpus of 80+ languages | Unigram LM | src | | uri(100k|250k|500k).bin | URL tokenization model trained on a large set of random URLs from the web | Unigram LM | src | | gpt2.bin | Byte-BPE tokenization model for GPT-2 | byte BPE | src | | roberta.bin | Byte-BPE tokenization model for Roberta model | byte BPE | src | | syllab.bin | Multi lingual model to identify allowed hyphenation points inside a word. | W2H | src |

Oh yes, it is also the fastest! We did a comparison of Bling Fire with tokenizers from Hugging Face, Bling Fire runs 4-5 times faster than Hugging Face Tokenizers, see also Bing Blog Post. We did comparison of Bling Fire Unigram LM and BPE implementaion to the same one in SentencePiece library and our implementation is ~2x faster, see XLNET benchmark and BPE benchmark. Not to mention our default models are 10x faster than the same functionality from SpaCy, see benchmark wiki and this Bing Blog Post.

So if low latency inference is what you need then you have to try Bling Fire!

Python API Description

If you simply want to use it in Python, you can install the latest release using pip:

pip install -U blingfire

Examples

1. Python example, using default pattern-based tokenizer:

from blingfire import *

text = 'After reading this post, you will know: What "natural language" is and how it is different from other types of data. What makes working with natural language so challenging. [1]'

print(text_to_sentences(text))
print(text_to_words(text))

Expected output:

After reading this post, you will know: What "natural language" is and how it is different from other types of data.
What makes working with natural language so challenging. [1]
After reading this post , you will know : What " natural language " is and how it is different from other types of data . What makes working with natural language so challenging . [ 1 ]

2. Python example, load a custom model for a pattern-based tokenizer:

from blingfire import *

# load a custom model from file
h = load_model("./wbd_chuni.bin")

text = 'This is the Bling-Fire tokenizer. 2007年9月日历表_2007年9月农历阳历一览表-万年历'

# custom model output
print(text_to_words_with_model(h, text))

# default model output
print(text_to_words(text))

free_model(h)

Expected output:

This is the Bling - Fire tokenizer . 2007 年 9 月 日 历 表 _2007 年 9 月 农 历 阳 历 一 览 表 - 万 年 历
This is the Bling - Fire tokenizer . 2007年9月日历表_2007年9月农历阳历一览表 - 万年历

3. Python example, calling BERT BASE tokenizer

On one thread, it works 14x faster than orignal BERT tokenizer written in Python. Given this code is written in C++ it can be called from multiple threads without blocking on global interpreter lock thus achieving higher speed-ups for batch mode.

import os
import blingfire

s = "Эpple pie. How do I renew my virtual smart card?: /Microsoft IT/ 'virtual' smart card certificates for DirectAccess are valid for one year. In order to get to microsoft.com we need to type pi@1.2.1.2."

# one time load the model (we are using the one that comes with the package)
h = blingfire.load_model(os.path.join(os.path.dirname(blingfire.__file__), "bert_base_tok.bin"))
print("Model Handle: %s" % h)

# use the model from one or more threads
print(s)
ids = blingfire.text_to_ids(h, s, 128, 100)  # sequence length: 128, oov id: 100
print(ids)                                   # returns a numpy array of length 128 (padded or trimmed)

# free the model at the end
blingfire.free_model(h)
print("Model Freed")

Expected output:

Model Handle: 2854016629088
Эpple pie. How do I renew my virtual smart card?: /Microsoft IT/ 'virtual' smart card certificates for DirectAccess are valid for one year. In order to get to microsoft.com we need to type pi@1.2.1.2.
[ 1208  9397  2571 11345  1012  2129  2079  1045 20687  2026  7484  6047
  4003  1029  1024  1013  7513  2009  1013  1005  7484  1005  6047  4003
 17987  2005  3622  6305  9623  2015  2024  9398  2005  2028  2095  1012
  1999  2344  2000  2131  2000  7513  1012  4012  2057  2342  2000  2828
 14255  1030  1015  1012  1016  1012  1015  1012  1016  1012     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0]
Model Freed

4. Python example to call a XLM-R tokenizer and prepare a pytorch batch

import os
import torch
from torch.nn.utils.rnn import pad_sequence
from blingfire import load_model, text_to_ids, free_model

# Load the XLM-RoBERTa tokenizer model provided by BlingFire
model_path = os.path.join("./data", 'xlm_roberta_base.bin')
tokenizer_model = load_model(model_path)

if __name__ == "__main__":
    # Sample input text
    input_texts = [
        "+1 (678) 274-9543 US https

BlingFire

Install / Use

README

Bling Fire

Introduction

Bling Fire Tokenizer Overview

Python API Description

Examples

1. Python example, using default pattern-based tokenizer:

2. Python example, load a custom model for a pattern-based tokenizer:

3. Python example, calling BERT BASE tokenizer

4. Python example to call a XLM-R tokenizer and prepare a pytorch batch