SemanticSlicer
🧠✂️ SemanticSlicer — A smart text chunker for LLM-ready documents.
Install / Use
/learn @drittich/SemanticSlicerREADME
🧠✂️ SemanticSlicer
Smart, recursive text slicing for LLM-ready documents.
SemanticSlicer is a lightweight C# application that recursively splits text into meaningful chunks—preserving semantic boundaries (sentences, headings, HTML tags) and ideal for embedding generation (OpenAI, Azure OpenAI, LangChain, etc.). You can run it on MacOs, Linux, or Windows and it can run from the command line, as a daemon, as service or as a REST API. You can also directly use the library by referencing the Nuget package in your code.
GitHub: https://github.com/drittich/SemanticSlicer
Table of Contents
- 🧠✂️ SemanticSlicer
Overview
This library accepts text and will break it into smaller chunks, typically useful for when creating LLM AI embeddings.
Nuget Installation
The package name is drittich.SemanticSlicer. You can install this from Nuget via the command line:
dotnet add package drittich.SemanticSlicer
or from the Package Manager Console:
NuGet\Install-Package drittich.SemanticSlicer
Download & Run (no build)
Prebuilt binaries are published under GitHub Releases of this repository: https://github.com/drittich/SemanticSlicer/releases
Choose the asset that matches your platform:
- Windows x64: SemanticSlicer.Cli-win-x64.zip
- macOS Intel: SemanticSlicer.Cli-osx-x64.zip
- macOS Apple Silicon: SemanticSlicer.Cli-osx-arm64.zip
- Linux x64: SemanticSlicer.Cli-linux-x64.zip
After downloading:
-
Windows:
- Unzip the file.
- Open Command Prompt in the unzipped folder.
- Language: cmd Command: SemanticSlicer.Cli.exe MyDocument.txt
- Or pipe input: Language: cmd Command: type MyDocument.txt | SemanticSlicer.Cli.exe
-
macOS:
- Unzip the file.
- In Terminal, mark the binary executable if needed and run:
- Intel:
- Language: bash Command: chmod +x SemanticSlicer.Cli && ./SemanticSlicer.Cli MyDocument.txt
- Apple Silicon:
- Language: bash Command: chmod +x SemanticSlicer.Cli && ./SemanticSlicer.Cli MyDocument.txt
- Intel:
- Pipe input:
- Language: bash Command: cat MyDocument.txt | ./SemanticSlicer.Cli
-
Linux:
- Unzip the file.
- Language: bash Command: chmod +x SemanticSlicer.Cli && ./SemanticSlicer.Cli MyDocument.txt
- Pipe input:
- Language: bash Command: cat MyDocument.txt | ./SemanticSlicer.Cli
Daemon mode (keeps slicer in memory):
- Language: bash Command: ./SemanticSlicer.Cli daemon
- Named pipe (Linux/macOS):
- Language: bash Command: ./SemanticSlicer.Cli daemon --pipe slicerpipe
Notes:
- These builds are self-contained; the .NET runtime is not required.
- If your OS flags the binary (macOS Gatekeeper), you may need to allow it in System Settings → Privacy & Security.
CLI Usage
Build the command-line tool:
dotnet publish SemanticSlicer.Cli/SemanticSlicer.Cli.csproj -c Release -o ./cli
Run once
Slice a file and output JSON chunk data:
dotnet ./cli/SemanticSlicer.Cli.dll --overlap 30 MyDocument.txt
You can also pipe text in (omit the overlap flag to use the default 0%):
cat MyDocument.txt | dotnet ./cli/SemanticSlicer.Cli.dll --overlap 20
Use the --overlap flag (0-100) to carry forward that percentage of the previous chunk's tokens, respecting your configured max chunk size.
Daemon mode
Keep a slicer in memory and read lines from stdin (or a named pipe):
dotnet ./cli/SemanticSlicer.Cli.dll daemon --overlap 25
Optionally listen on a named pipe:
dotnet ./cli/SemanticSlicer.Cli.dll daemon --pipe slicerpipe --overlap 25
Service Installation
The repository includes a small Web API (SemanticSlicer.Service) that can be
installed as a background service so the slicer stays in memory.
First publish the service:
dotnet publish SemanticSlicer.Service/SemanticSlicer.Service.csproj -c Release -o ./publish
Linux (systemd)
- Copy the
./publishfolder to/opt/semanticslicer(or a location of your choice). - Create
/etc/systemd/system/semanticslicer.servicewith:
[Unit]
Description=Semantic Slicer Service
After=network.target
[Service]
Type=simple
WorkingDirectory=/opt/semanticslicer
ExecStart=/usr/bin/dotnet /opt/semanticslicer/SemanticSlicer.Service.dll
Restart=always
[Install]
WantedBy=multi-user.target
- Enable and start the service:
sudo systemctl enable semanticslicer
sudo systemctl start semanticslicer
Windows
- Publish the service to a folder, e.g.
C:\SemanticSlicer:
dotnet publish SemanticSlicer.Service/SemanticSlicer.Service.csproj -c Release -o C:\SemanticSlicer
- From an elevated command prompt install and start the service:
sc create SemanticSlicer binPath= "\"%ProgramFiles%\dotnet\dotnet.exe\" \"C:\\SemanticSlicer\\SemanticSlicer.Service.dll\""
sc start SemanticSlicer
Once running you can POST text to the service:
curl -X POST http://localhost:5000/slice -H "Content-Type: application/json" \
-d '{"content":"Hello world","overlapPercentage":30}'
overlapPercentage is optional (defaults to 0) and clamped between 0 and 100. Header tokens also count toward the overlap budget.
Sample Usage
Simple text document:
// The default options uses text separators, a max chunk size of 1,000, and
// cl100k_base encoding to count tokens.
var slicer = new Slicer();
var text = File.ReadAllText("MyDocument.txt");
var documentChunks = slicer.GetDocumentChunks(text);
Markdown document:
// Let's use Markdown separators and reduce the chunk size
var options = new SlicerOptions { MaxChunkTokenCount = 600, Separators = Separators.Markdown };
var slicer = new Slicer(options);
var text = File.ReadAllText("MyDocument.md");
var documentChunks = slicer.GetDocumentChunks(text);
Overlapping chunks:
// Reuse the last 30% of the previous chunk (by tokens), while still respecting the max size
var options = new SlicerOptions { MaxChunkTokenCount = 800, OverlapPercentage = 30 };
var slicer = new Slicer(options);
var documentChunks = slicer.GetDocumentChunks(text);
HTML document:
var options = new SlicerOptions { Separators = Separators.Html };
var slicer = new Slicer(options);
var text = File.ReadAllText("MyDocument.html");
var documentChunks = slicer.GetDocumentChunks(text);
Removing HTML tags:
For any content you can choose to remove HTML tags from the chunks to minimize the number of tokens. The inner text is preserved, and if there is a <Title> tag the title will be pre-pended to the result:
// Let's remove the HTML tags as they just consume a lot of tokens without adding much value
var options = new SlicerOptions { Separators = Separators.Html, StripHtml = true };
var slicer = new Slicer(options);
var text = File.ReadAllText("MyDocument.html");
var documentChunks = slicer.GetDocumentChunks(text);
Custom separators:
You can pass in your own list if of separators if you wish, e.g., if you wish to add support for other documents.
Advanced Usage
For advanced scenarios where you need full control over preprocessing, SemanticSlicer provides lower-level APIs:
Split Engine Without Preprocessing
Use SplitDocumentChunksRaw when you want to apply your own preprocessing but still benefit from token-aware splitting, overlap, and indexing:
var slicer = new Slicer();
// Apply your own custom preprocessing
var customProcessed = MyCustomPreprocessing(rawHtml);
// Split using the engine directly (no normalization, HTML stripping, or whitespace collapsing)
var chunks = slicer.SplitDocumentChunksRaw(customProcessed);
Important: SplitDocumentChunksRaw treats content exactly as provided:
- Does NOT normalize line endings
- Does NOT strip HTML (even if
StripHtmlis true) - Does NOT collapse whitespace
- Does NOT trim content
- Offsets in returned chunks are relative to the exact content string you provide
Preprocessing Utilities
SemanticSlicer exposes the same preprocessing utilities used internally:
// Normalize line endings (CRLF and CR to LF)
var normalized = TextUtilities.NormalizeLineEndings(input);
// Collapse excessive whitespace (max 2 consecutive spaces or newlines)
var collapsed = TextUtilities.CollapseWhitespace(input);
// Extract text from HTML (already public on Slicer instance)
var slicer = new Slicer();
var plainText = slicer.RemoveNonBodyContent(htmlContent);
Combine these with SplitDocumentChunksRaw for custom pipelines:
var slicer = new Slicer()
Related Skills
node-connect
349.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
