Semchunk
A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.
Install / Use
/learn @isaacus-dev/SemchunkREADME
semchunk 🧩
<a href="https://pypi.org/project/semchunk/"><img src="https://img.shields.io/pypi/v/semchunk" alt="PyPI version" /></a> <a href="https://github.com/isaacus-dev/semchunk/actions/workflows/ci.yml"><img src="https://img.shields.io/github/actions/workflow/status/isaacus-dev/semchunk/ci.yml?branch=main" alt="Build status"></a> <a href="https://app.codecov.io/gh/isaacus-dev/semchunk"><img src="https://img.shields.io/codecov/c/github/isaacus-dev/semchunk" alt="Code coverage"></a> <a href="https://pypistats.org/packages/semchunk"><img src="https://img.shields.io/pypi/dm/semchunk" alt="Monthly downloads"></a>
</div>semchunk is a Python library for splitting text into smaller chunks while preserving as much local semantic context as possible.
semchunk supports AI-powered chunking, chunk overlapping, and chunk offsets, and works with any tokenizer or token counter, including those from Tiktoken and Transformers.
Powered by a novel hierarchical chunking algorithm, semchunk delivers 15% better RAG performance than its closest competitors (see Benchmarks 📊).
semchunk is production-ready. It is downloaded millions of times per month and is used in Docling, the Microsoft Intelligence Toolkit, and the Isaacus API.
Setup 📦
semchunk can be installed with pip (or uv):
pip install semchunk
If you're using AI-powered chunking, you'll also want to install the Isaacus SDK and obtain an Isaacus API key:
pip install isaacus
semchunk is also available on conda-forge:
conda install conda-forge::semchunk
# or
conda install -c conda-forge semchunk
@dominictarro maintains a Rust port of semchunk named semchunk-rs.
Quickstart 👩💻
The code snippet below demonstrates how to chunk text with semchunk:
import semchunk
import tiktoken # Transformers and Tiktoken are not dependencies,
from transformers import AutoTokenizer # they're just here for demonstration purposes.
chunk_size = 4 # A low chunk size is used here for demonstration purposes. Keep in mind, semchunk
# does not know how many special tokens, if any, your tokenizer adds to every input,
# so you may want to deduct the number of special tokens added from your chunk size.
text = 'The quick brown fox jumps over the lazy dog.'
# You can construct a chunker with `semchunk.chunkerify()` by passing the name of an OpenAI model,
# OpenAI Tiktoken encoding or Hugging Face model, or a custom tokenizer that has an `encode()` method
# (like a Tiktoken or Transformers tokenizer) or a custom token counting function that takes a text and
# returns the number of tokens in it.
chunker = semchunk.chunkerify('isaacus/kanon-2-tokenizer', chunk_size) or \
semchunk.chunkerify('gpt-4', chunk_size) or \
semchunk.chunkerify('cl100k_base', chunk_size) or \
semchunk.chunkerify(AutoTokenizer.from_pretrained('isaacus/kanon-2-tokenizer'), chunk_size) or \
semchunk.chunkerify(tiktoken.encoding_for_model('gpt-4'), chunk_size) or \
semchunk.chunkerify(lambda text: len(text.split()), chunk_size)
# If you give the resulting chunker a single text, it'll return a list of chunks. If you give it a
# list of texts, it'll return a list of lists of chunks.
assert chunker(text) == ['The quick brown fox', 'jumps over the', 'lazy dog.']
assert chunker([text], progress=True) == [['The quick brown fox', 'jumps over the', 'lazy dog.']]
# If you have a lot of texts and you want to speed things up, you can enable multiprocessing by
# setting `processes` to a number greater than 1.
assert chunker([text], processes=2) == [['The quick brown fox', 'jumps over the', 'lazy dog.']]
# You can also pass an `offsets` argument to return the offsets of chunks, as well as an `overlap`
# argument to overlap chunks by a ratio (if < 1) or an absolute number of tokens (if >= 1).
chunks, offsets = chunker(text, offsets=True, overlap=0.5)
To leverage AI-powered chunking, ensure the Isaacus SDK is installed and your ISAACUS_API_KEY environment variable is set, and then simply pass the name of an Isaacus enrichment model like kanon-2-enricher as the chunking_model argument of chunkerify() or chunk(), like so:
import requests # For demonstration purposes, we'll use `requests` to download a long document.
import semchunk
from os import environ
# Set your `ISAACUS_API_KEY` environment variable to your Isaacus API key.
environ["ISAACUS_API_KEY"] = "INSERT_YOUR_API_KEY_HERE"
# Download a very long document to chunk.
text = requests.get("https://examples.isaacus.com/dred-scott-v-sandford.txt").text
# Construct a chunker that uses `kanon-2-enricher` for AI-powered chunking.
# NOTE Because we're using a Hugging Face Transformers tokenizer, the `transformers` library is required here,
# however, you can use any tokenizer or token counter you like.
chunker = semchunk.chunkerify("isaacus/kanon-2-tokenizer", 512, chunking_model="kanon-2-enricher")
# Chunk the document with AI-powered chunking.
chunks = chunker(text)
Usage 🕹️
chunkerify()
def chunkerify(
tokenizer_or_token_counter: str | tiktoken.Encoding | transformers.PreTrainedTokenizer | \
tokenizers.Tokenizer | Callable[[str], int],
chunk_size: int | None = None,
*,
chunking_model: str | None = None,
isaacus_client: isaacus.Isaacus | None = None,
tokenizer_kwargs: dict | None = None,
max_token_chars: int | None = None,
memoize: bool = True,
cache_maxsize: int | None = None,
) -> Callable[[str | ILGSDocument | Sequence[str | ILGSDocument], int, bool, bool, int | float | None], list[str] | tuple[list[str], list[tuple[int, int]]] | list[list[str]] | tuple[list[list[str]], list[list[tuple[int, int]]]]]:
chunkerify() constructs a chunker that splits one or more texts into semantically meaningful chunks of a specified size as determined by the provided tokenizer or token counter.
tokenizer_or_token_counter is either: the name of a Tiktoken or Transformers tokenizer (with priority given to the former); a tokenizer with an encode() method (e.g., a Tiktoken or Transformers tokenizer); or a token counter that returns the number of tokens in an input.
chunk_size is the maximum number of tokens a chunk may contain. It defaults to None, in which case it will be set to the same value as the tokenizer's model_max_length attribute (minus the number of tokens returned by attempting to tokenize an empty string) if possible, otherwise a ValueError will be raised.
chunking_model is the name of the Isaacus enrichment model to use for AI chunking. This argument defaults to None, in which case, unless you provide an Isaacus Legal Graph Schema (ILGS) Document as input, AI chunking will be disabled.
isaacus_client is an instance of the isaacus.Isaacus API client to use for AI chunking instead of a client constructed with default parameters. This argument defaults to None, in which case a client will be constructed with default parameters if AI chunking is enabled.
tokenizer_kwargs is an optional dictionary of keyword arguments to be passed to the tokenizer or token counter whenever it is called. This can be used to disable the current default behavior of treating any encountered special tokens as if they are normal text when using a Tiktoken or Transformers tokenizer. This argument defaults to None, in which case no additional keyword arguments will be passed to the tokenizer or token counter.
max_token_chars is the maximum number of characters a token may contain. It is used to significantly speed up the token counting of long inputs. It defaults to None in which case it will either not be used or will, if possible, be set to the number of characters in the longest token in the tokenizer's vocabulary as determined by the token_byte_values or get_vocab methods.
memoize flags whether to memoize the token counter. It defaults to True.
cache_maxsize is the maximum number of text-token count pairs that can be stored in the token counter's cache. It defaults to None, which makes the cache unbounded. This argument is only used if memoize is True.
This function returns a chunker that takes either a single input or a sequence of inputs and returns, depending on whether multiple inputs have been provided, a list or list of lists of chunks up to chunk_size-tokens-long with the whitespace used to split the input removed, and, if the optional offsets argument to the chunker is True, a list or lists of tuples of the form (start, end) where start is the index of the first character of a chunk in an input and end is the index of the character succeeding the last character of the chunk such that chunks[i] == text[offsets[i][0]:offsets[i][1]].
The resulting chunker can be passed a processes argument that specifies the number of processes to be used when chunking multiple inputs.
It is also possible to pass a progress argument which, if set to True and multiple inputs are passed, will display a progress bar.
As described above, the offsets argument, if set to True, will cause the chunker to return the start and end offsets of each chunk.
The chunker accepts an overlap argument that specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap. It defaults to None, in which case no overlapping occurs.
chunk()
def chun
Related Skills
node-connect
351.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
claude-opus-4-5-migration
110.9kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
frontend-design
110.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
model-usage
351.8kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
