<h1 align="center">wtpsplit🪓</h1> <h3 align="center">Segment any Text - Robustly, Efficiently, Adaptably⚡</h3>

This repository allows you to segment text into sentences or other semantic units. It implements the models from:

SaT — Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation by Markus Frohmann, Igor Sterner, Benjamin Minixhofer, Ivan Vulić and Markus Schedl (state-of-the-art, encouraged).
WtP — Where’s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation by Benjamin Minixhofer, Jonas Pfeiffer and Ivan Vulić (previous version, maintained for reproducibility).

The namesake WtP is maintained for consistency. Our new followup SaT provides robust, efficient and adaptable sentence segmentation across 85 languages at higher performance and less compute cost. Check out the state-of-the-art results in 8 distinct corpora and 85 languages demonstrated in our Segment any Text paper.

System Figure

Installation

pip install wtpsplit

Or one of the following for ONNX support:

pip install wtpsplit[onnx-gpu]
pip install wtpsplit[onnx-cpu]

Usage

from wtpsplit import SaT

sat = SaT("sat-3l")
# optionally run on GPU for better performance
# also supports TPUs via e.g. sat.to("xla:0"), in that case pass `pad_last_batch=True` to sat.split
sat.half().to("cuda")

sat.split("This is a test This is another test.")
# returns ["This is a test ", "This is another test."]

# do this instead of calling sat.split on every text individually for much better performance
sat.split(["This is a test This is another test.", "And some more texts..."])
# returns an iterator yielding lists of sentences for every text

# use our '-sm' models for general sentence segmentation tasks
sat_sm = SaT("sat-3l-sm")
sat_sm.half().to("cuda") # optional, see above
sat_sm.split("this is a test this is another test")
# returns ["this is a test ", "this is another test"]

# use trained lora modules for strong adaptation to language & domain/style
sat_adapted = SaT("sat-3l", style_or_domain="ud", language="en")
sat_adapted.half().to("cuda") # optional, see above
sat_adapted.split("This is a test This is another test.")
# returns ['This is a test ', 'This is another test']

ONNX Support

🚀 You can now enable even faster ONNX inference for sat and sat-sm models! 🚀

sat = SaT("sat-3l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

>>> from wtpsplit import SaT
>>> texts = ["This is a sentence. This is another sentence."] * 1000

# PyTorch GPU
>>> model_pytorch = SaT("sat-3l-sm")
>>> model_pytorch.half().to("cuda");
>>> %timeit list(model_pytorch.split(texts))
# 144 ms ± 252 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# quite fast already, but...

# onnxruntime GPU
>>> model_ort = SaT("sat-3l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
>>> %timeit list(model_ort.split(texts))
# 94.9 ms ± 165 μs per loop (mean ± std. dev. of 7 runs, 10 loops each
# ...this should be ~50% faster! (tested on RTX 3090)

If you wish to use LoRA in combination with an ONNX model:

Run scripts/export_to_onnx_sat.py with use_lora: True and an appropriate output_dir: <OUTPUT_DIR>.
- If you have a local LoRA module, use lora_path.
- If you wish to load a LoRA module from the HuggingFace hub, use style_or_domain and language.
Load the ONNX model with merged LoRA weights: sat = SaT(<OUTPUT_DIR>, onnx_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

Available Models

If you need a general sentence segmentation model, use -sm models (e.g., sat-3l-sm) For speed-sensitive applications, we recommend 3-layer models (sat-3l and sat-3l-sm). They provide a great tradeoff between speed and performance. The best models are our 12-layer models: sat-12l and sat-12l-sm.

| Model | English Score | Multilingual Score | | :--------------------------------------------------------------------------- | ------------: | -----------------: | | sat-1l | 88.5 | 84.3 | | sat-1l-sm | 88.2 | 87.9 | | sat-3l | 93.7 | 89.2 | | sat-3l-lora | 96.7 | 94.8 | | sat-3l-sm | 96.5 | 93.5 | | sat-6l | 94.1 | 89.7 | | sat-6l-sm | 96.9 | 95.1 | | sat-9l | 94.3 | 90.3 | | sat-12l | 94.0 | 90.4 | | sat-12l-lora | 97.3 | 95.9 | | sat-12l-sm | 97.4 | 96.0 |

The scores are macro-average F1 score across all available datasets for "English", and macro-average F1 score across all datasets and languages for "Multilingual". "adapted" means adapation via LoRA; check out the paper for details.

For comparison, here the English scores of some other tools:

| Model | English Score | | :------------------------------------------------------- | ------------: | | PySBD | 69.6 | | SpaCy (sentencizer; monolingual) | 92.9 | | SpaCy (sentencizer; multilingual) | 91.5 | | Ersatz | 91.4 | | Punkt (nltk.sent_tokenize) | 92.2 | | WtP (3l) | 93.9 |

Note that this library also supports previous WtP models. You can use them in essentially the same way as SaTmodels:

from wtpsplit import WtP

wtp = WtP("wtp-bert-mini")
# similar functionality as for SaT models
wtp.split("This is a test This is another test.")

For more details on WtP and reproduction details, see the WtP doc.

Paragraph Segmentation

Since SaT are trained to predict newline probablity, they can segment text into paragraphs in addition to sentences.

# returns a list of paragraphs, each containing a list of sentences
# adjust the paragraph threshold via the `paragraph_threshold` argument.
sat.split(text, do_paragraph_segmentation=True)

(NEW! v2.2+) Length-Constrained Segmentation

Control segment lengths with min_length and max_length parameters. This is useful when you need segments within specific size limits (e.g., for embedding models, storage, or downstream processing).

Basic Usage

from wtpsplit import SaT

sat = SaT("sat-3l-sm")

text = (
    "In the beginning God created the heaven and the earth. "
    "And the earth was without form, and void; and darkness was upon the face of the deep. "
    "And the Spirit of God moved upon the face of the waters. "
    "And God said, Let there be light: and there was light. "
    "And God saw the light, that it was good: and God divided the light from the darkness. "
    "And God called the light Day, and the darkness he called Night. "
    "And the evening and the morning were the first day."
)

# Split with a maximum segment length of 120 characters
segments = sat.split(text, max_length=120)
for i, s in enumerate(segments):
    print(f"[{len(s):3d} chars] {s}")
# [ 55 chars] In the beginning God created the heaven and the earth. 
# [ 86 chars] And the earth was without form, and void; and darkness was upon the face of the deep. 
# [112 chars] And the Spirit of God moved upon the face of the waters. And God said, Let there be light: and there was light. 
# [ 86 chars] And God saw the light, that it was good: and God divided the light from the darkness. 
# [115 chars] And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.

assert "".join(segments) == text  # text is perfectly preserved

# Enforce both min and max length
sat.split(text, min_length=80, max_length=200)

# Use the greedy algorithm for minimally faster (but less optimal) results
sat.split(text, max_length=120, algorithm="greedy")

Priors for Length Preference

Use priors to influence segment length distribution. Available priors:

| Prior | Best For | |-------|----------| | "uniform" (default) | Just enforce max_length, let model decide | | "gaussian" | Prefer segments around a target length (intuitive) | | "lognormal" | Right-skewed preference (more tolerant of longer segments) | | "clipped_polynomial" | Must be very close to target length |

# Gaussian prior (recommended): prefer segments around target_length
sat.split(text, max_length=100, prior_type="gaussian", 
          prior_kwargs={"target_length": 50, "spread": 10})

# Log-normal prior: right-skewed (more tolerant of longer segments)
sat.split(text, max_length=100, prior_type="lognormal", 
          prior_kwargs={"target_length": 70, "spread"

Wtpsplit

Install / Use

README