78 skills found · Page 1 of 3
karpathy / MinbpeMinimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
niieani / Gpt TokenizerThe fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT models (gpt-5, gpt-o*, gpt-4o, etc.). Port of OpenAI's tiktoken with additional features.
guillaume-be / Rust TokenizersRust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models
OpenNMT / TokenizerFast and customizable text tokenization library with BPE and SentencePiece support
microsoft / TokenizerTypescript and .NET implementation of BPE tokenizer for OpenAI LLMs.
gautierdag / BpeasyFast bare-bones BPE for modern tokenizer training
sefineh-ai / Amharic TokenizerSyllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.
alfianlosari / GPTEncoderSwift BPE Encoder/Decoder for OpenAI GPT Models. A programmatic interface for tokenizing text for OpenAI ChatGPT API.
tryAGI / TiktokenHigh-performance .NET BPE tokenizer — up to 618 MiB/s, competitive with Rust. Zero-allocation counting, multilingual cache, o200k/cl100k/r50k/p50k encodings + HuggingFace tokenizer.json support.
mast-group / OpenVocabCodeNLMContains the code for our ICSE 2020 paper: Big Code != Big Vocabulary: Open-Vocabulary Language Models for Source Code and for its earlier pre-print: Maybe Deep Neural Networks are the Best Choice for Modeling Source Code (https://arxiv.org/abs/1903.05734). This is the first open vocabulary language model for code that uses the byte pair encoding algorithm (BPE) to learn a segmentation of code tokens into subword units.
samber / Go Gpt 3 EncoderGo BPE tokenizer (Encoder+Decoder) for GPT2 and GPT3
ml-rust / SplintrA high-performance tokenizer (BPE + SentencePiece) built with Rust with Python bindings, focused on speed, safety, and resource optimization.
owenliang / Bpe TokenizerLLM Tokenizer with BPE algorithm
Systemcluster / KitokenFast tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and WordPiece tokenization in JavaScript, Python and Rust.
aallam / KtokenKotlin multiplatform BPE tokenizer library for OpenAI models
kuprel / Minbpe PytorchMinimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization, with PyTorch/CUDA
gweidart / Rs BpeA ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
ChaitanyaK77 / Building A Small Language Model SLM This Repository provides a Jupyter Notebook for building a small language model from scratch using 'TinyStories' dataset. Covers data preprocessing, BPE tokenization, binary storage, GPU memory management, and training a Transformer in PyTorch. Generate sample stories to test your model. Ideal for learning NLP and PyTorch.
transitive-bullshit / Compare TokenizersA test suite comparing Node.js BPE tokenizers for use with AI models.
youkaichao / Fast Bpe Tokenizerfast bpe tokenizer, simple to understand, easy to use