Results for "bpe-tokenizer"

Claude Code Claude Desktop GitHub Copilot Cursor Windsurf Cline Zed JetBrains

📄SKILL.md 🤖CLAUDE.md ⚡Claude Commands 📐.cursorrules 📐Cursor Rules 🕹️AGENTS.md 🧬codex.md 🏄.windsurfrules 🔧.clinerules 🧑‍✈️Copilot Instructions

All Development Operations Data Product Marketing Customer Design Sales

78 skills found · Page 1 of 3

karpathy / Minbpe

10.4k

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

universal

Updated 1h ago

niieani / Gpt Tokenizer

764

The fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT models (gpt-5, gpt-o*, gpt-4o, etc.). Port of OpenAI's tiktoken with additional features.

universal

bpedecoderencoder+9

Updated 1h ago

guillaume-be / Rust Tokenizers

338

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

universal

deep-learningrust-langtokenizer+1

Updated 2d ago

OpenNMT / Tokenizer

331

Fast and customizable text tokenization library with BPE and SentencePiece support

universal

bpecppicu+7

Updated 8d ago

microsoft / Tokenizer

211

Typescript and .NET implementation of BPE tokenizer for OpenAI LLMs.

universal

aigptllm+2

Updated 1d ago

gautierdag / Bpeasy

176

Fast bare-bones BPE for modern tokenizer training

universal

bpetokenizationtokenizer

Updated 1mo ago

sefineh-ai / Amharic Tokenizer

Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.

universal

african-languagesamharicamharic-tokenizer+14

Updated 2mo ago

alfianlosari / GPTEncoder

Swift BPE Encoder/Decoder for OpenAI GPT Models. A programmatic interface for tokenizing text for OpenAI ChatGPT API.

universal

chatgptencoder-decodergpt+5

Updated 5d ago

tryAGI / Tiktoken

High-performance .NET BPE tokenizer — up to 618 MiB/s, competitive with Rust. Zero-allocation counting, multilingual cache, o200k/cl100k/r50k/p50k encodings + HuggingFace tokenizer.json support.

universal

aibpecl100k-base+11

Updated 1d ago

mast-group / OpenVocabCodeNLM

Contains the code for our ICSE 2020 paper: Big Code != Big Vocabulary: Open-Vocabulary Language Models for Source Code and for its earlier pre-print: Maybe Deep Neural Networks are the Best Choice for Modeling Source Code (https://arxiv.org/abs/1903.05734). This is the first open vocabulary language model for code that uses the byte pair encoding algorithm (BPE) to learn a segmentation of code tokens into subword units.

universal

Updated 9mo ago

samber / Go Gpt 3 Encoder

Go BPE tokenizer (Encoder+Decoder) for GPT2 and GPT3

codex

bpebyte-pair-encodingcodex+9

Updated 5d ago

ml-rust / Splintr

A high-performance tokenizer (BPE + SentencePiece) built with Rust with Python bindings, focused on speed, safety, and resource optimization.

universal

huggingfacellmmachine-learning+3

Updated 8d ago

owenliang / Bpe Tokenizer

LLM Tokenizer with BPE algorithm

universal

Updated 18d ago

Systemcluster / Kitoken

Fast tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and WordPiece tokenization in JavaScript, Python and Rust.

universal

bpenlpnodejs+7

Updated 6d ago

aallam / Ktoken

Kotlin multiplatform BPE tokenizer library for OpenAI models

universal

binary-pbpebyte-pair-encoding+5

Updated 1d ago

kuprel / Minbpe Pytorch

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization, with PyTorch/CUDA

universal

Updated 2mo ago

gweidart / Rs Bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

universal

bpebpe-tokenizerbyte-pair-encoding+9

Updated 9d ago

ChaitanyaK77 / Building A Small Language Model SLM

This Repository provides a Jupyter Notebook for building a small language model from scratch using 'TinyStories' dataset. Covers data preprocessing, BPE tokenization, binary storage, GPU memory management, and training a Transformer in PyTorch. Generate sample stories to test your model. Ideal for learning NLP and PyTorch.

universal

gpu-computingllmsmall-language-models+2

Updated 7h ago

transitive-bullshit / Compare Tokenizers

A test suite comparing Node.js BPE tokenizers for use with AI models.

universal

Updated 10mo ago

youkaichao / Fast Bpe Tokenizer

fast bpe tokenizer, simple to understand, easy to use

universal

Updated 2mo ago