SkillAgentSearch skills...

TkTkT

A collection of Pythonic subword tokenisers and text preprocessing tools.

Install / Use

/learn @bauwenst/TkTkT
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<img src="./doc/logo.png">

TkTkT: the ToKeniser ToolKiT

A collection of Pythonic subword tokenisers and text preprocessing tools, with full backwards- and forwards-compatibility with HuggingFace tokenizers. One package to rule them all.

Quick navigation:

  • <a href="#installation">Installation</a>
  • <a href="#features">Features</a>
  • <a href="#examples">Examples</a>
  • <a href="#architecture">Architecture</a>

Features

Supported tokenisers

All subword tokenisers are defined under tktkt.models. Many of these can be instantiated without much background knowledge using the factory classes in tktkt.factories. Also, any HuggingFace tokeniser can be wrapped into a TkTkT tokeniser, and any TkTkT tokeniser can be wrapped into a HuggingFace tokeniser.

Currently, the package implements:

  • Byte-pair encoding (BPE) tokenisers:
    • Classical BPE (Sennrich et al., 2016), with added support for any word boundary marker (Ġ, _, </w>, ...) and n-ary merges (byte-tuple encoding, BTE).
    • BPE-dropout (Provilkov et al., 2020)
    • BPE-knockout (Bauwens & Delobelle, 2024)
    • PickyBPE (Chizhov et al., 2024)
    • ScaffoldBPE (Lian et al., 2025)
    • TrimmedBPE (Cognetta et al., 2024)
    • Other experimental variants I implemented just for fun:
      • BPE-breakdown: BPE which starts randomly undoing merges after it finishes deterministically, similar to StochasTok.
      • Non-geometric BPE-dropout: BPE-dropout, but rather than picking merges geometrically, picks them uniformly.
      • EnsuredBPE: BPE where the last merges have been replaced by the merges necessary to ensure that a given list of strings is in the vocabulary.
      • ShuffledBPE: BPE but with merge priorities shuffled, although types are never shuffled to a priority before the ancestors in their merge tree.
  • Unigram language model (ULM), dubbed KudoPiece in TkTkT (Kudo, 2018):
    • Wrapper around the SentencePiece package, or
    • Native implementation in TkTkT
  • Greedy tokenisers:
    • MaxMatch (Hiraoka, 2022), a.k.a. left-to-right greedy tokenisation, and also right-to-left (Bauwens, 2023 and later Uzan et al., 2024)
    • FLOTA (Hofmann et al., 2022), i.e. random-access longest-first tokenisation.
    • Other experimental variants:
      • Last-BPE-first: random-access youngest-first tokenisation (specifically for BPE vocabularies).
      • Left-to-right-to-left greedy: L2R2L_Greedy
  • Subword regularisers:
  • SaGe (Yehezkel & Pinter, 2023) vocabularisation.
  • Derivative leverager (DeL) (Hofmann et al., 2021), both training and segmentation.
  • Other, less interesting tokenisers:

Currently work in progress:

Multiplexing

TkTkT is the only package that supports multiplexing multiple tokenisers into one big tokeniser that alternates between each of them. There are multiplexers that do this deterministically (e.g. choosing the tokeniser that compresses the input the most) or stochastically (e.g. choosing among a set of tokenisers uniformly).

Evaluation metrics

TkTkT's evaluation framework aims to do as little work as possible. It can dispatch tokens produced by a tokeniser to as many metrics as you need at once, and caches everything it can so you don't have to compute any metric twice. See here for an example.

TkTkT currently supports the following intrinsic tokeniser evaluation metrics:

  • Fertility statistics: how many tokens the tokeniser produces per word, and how many segmentations its vocabulary could produce in theory.
  • Morphological boundary recognition: using the tokeniser as a binary classifier for whether two morphemes meet at each position in a word.
  • Information-theoretic measures, including Rényi entropy and Rényi efficiency.
  • Window-based metrics like MATTR.
  • Bigram metrics to quantify the richness of token contexts, like accessor variety.
  • Comparisons between two tokenisers: how much they tokenise words exactly the same, and how much their split points overlap.

Security

It should be impossible for users to jailbreak a language model by forcing its tokeniser to produce a special token (e.g. <|endoftext|>, system prompt delimiters, ...). The reason other packages do not have this guarantee is that they represent special tokens as strings and then give the tokeniser access to these strings. In TkTkT, subword vocabularies are objects that hide special tokens from their tokeniser. In fact, special tokens are defined as integers, not as strings.

When loading a tokeniser trained in another package that probably inserted specials into the vocabulary, TkTkT explicitly requires the user to declare which strings are actually specials, and hides them from the tokeniser.

User-friendliness

Caching

When a tokeniser finishes training, you shouldn't be forced to mess with file paths to connect your training and testing scripts together. In TkTkT, training code caches its results. When you rerun it, it will load its results from disk and skip the waiting time.

Type-checking

I really f*cking passionately hate when my IDE cannot perform autocompletion because of poor type annotation or design. People have been suffering under the idiocy of AutoTokenizer for too long. In TkTkT, everything is as type-annotated as possible, which means:

  • There are no checkpoint strings in TkTkT. There are Artifacts objects, which not only declare how to get the results of training, but also which Preprocessor was used, so that complex preprocessing objects are just known and never have to be stringified.
  • When loading a tokeniser using a TokeniserFactory and Artifacts (the equivalent of AutoTokenizer of a str), the exact type of tokeniser is known.
  • The special tokens in a vocabulary appear in autocompletion. For example, for BERT's BPE tokeniser, tokeniser.vocab.specials. will show CLS, SEP, PAD, MASK in your IDE.
  • After training a tokeniser, the results on disk are already parsed into an object for you. When you train a BPE tokeniser, the results that come out aren't a dumb file path. It's an Artifacts object that has a .getVocabulary() and .getMerges() method.

Preprocessing

TkTkT has a rich set of text mappings and pretokenisers that preprocess text before it is tokenised, including support for stochastic perturbation. Unlike other libraries, preprocessors are objects, not regular expressions. This allows much more powerful processing than regex, whilst being more easy to read. See if you can understand this arguably complicated transformation:

from tktkt.preparation.splitters import *
from tktkt.preparation.mappers import PseudoByteMapping
from tktkt.factories.preprocessors import RobertaSpaceMarker


class ExamplePretokeniser(PretokeniserSequence):
    def __init__(self):
        super().__init__([
            IsolatePunctuation(HyphenMode.EXCLUDED, protect_apostrophes_without_spaces=True),
            OnWhitespace(destructive=True),
            IsolateEnglishContractions(do_nt=True),

            MapperAsPretokeniser(PseudoByteMapping()),
            AddWordBoundary(RobertaSpaceMarker),

            IsolateDigits(),
            IsolatePunctuation(HyphenMode.ONLY)
        ])

TkTkT also comes with language-specific pretokenisation like Japanese word segmentation and Thai word segmentation.

Visualisers

The following tokenisation procedures can be visualised:

  • BPE/BTE: the final merge tree (in regular LaTeX), as well as an animated progression of the merges (in LaTeX Beamer).

Opinionated

Apart from the type-checking and caching described above, TkTkT enforces several truths about tokenisation which are not present in other packages:

  • Tokenisers produce string segments (tokens), not integer identifiers (IDs). There is no "BPE way" of mapping ["_un", "believ", "able"] to integers. That should be done by a separate object, the Vocab.
  • Preprocessors should not (only) be regular expressions. They should be chains of Python code.
  • Spaces should not be treated as word boundaries. If a word does not have a "prefix space", e.g. because it is the start of the sentence or because it is preceded by a punctuation mark, it should still receive a boundary character, and the developer should be able to decide if that boundary comes at the start or the end of the word.
  • Special tokens (CLS, SEP, BOS, ...) should not have a string representation ("[CLS]", "[SEP]", "<s>", ...), or at least they should not be in the vocabulary, or at least they

Related Skills

View on GitHub
GitHub Stars13
CategoryDevelopment
Updated6h ago
Forks1

Languages

Python

Security Score

90/100

Audited on Mar 27, 2026

No findings