Chunky
Convert and validate your Markdown, then choose the best chunking strategy for your RAG pipeline.
Install / Use
/learn @GiovanniPasq/ChunkyREADME
Why Chunky?
Most RAG pipelines fail silently — not because of bad chunking, but because of bad Markdown. When PDFs are converted, tables collapse, layouts scramble, and artifacts bleed into your text. You never see it. You just get hallucinations downstream. Chunky is a local, open-source tool that gives you full visibility at both stages — validate your Markdown, validate your chunks, fix what's wrong before it reaches your vector store.
As NVIDIA's research shows, no single chunking strategy wins universally. Chunking is not a set-and-forget parameter — yet most tools give you zero visibility into what your chunks actually look like. That's the gap Chunky fills.
<p align="center"> <img src="assets/pipeline.svg" width="700"> </p>🚧 Chunky is in early development and actively evolving. Bugs may exist — if you find one, please open an issue.
New to RAG? Check out Agentic RAG for Dummies — a hands-on implementation of Agentic RAG.
Features
| | | |---|---| | 📄 Side-by-side viewer | PDF and Markdown side-by-side with synchronized scrolling | | ✨ Four PDF → Markdown converters | PyMuPDF, Docling, MarkItDown, VLM — switch on the fly without losing your settings | | 🔄 Re-convert on the fly | Switch converter and regenerate without restarting the pipeline | | 📦 Bulk PDF conversion | Convert multiple PDFs to Markdown in a single batch operation | | ✂️ 12 chunking strategies | LangChain (4 strategies) and Chonkie (8 strategies) | | 📚 Bulk chunking | Chunk multiple Markdown files at once with the same configuration | | 🎨 Color-coded chunk visualization | See every chunk numbered and color-coded — edit any of them directly | | 🧠 Markdown enrichment (beta) | Clean conversion artifacts before chunking | | ✨ Chunk enrichment (beta) | LLM-generated titles, summaries, keywords, and questions per chunk | | 🔌 Pluggable architecture | Add a converter or splitter in minutes — zero frontend changes | | 💾 Export | Timestamped JSON chunks, ready for your vector store |
Getting started
Two ways to run Chunky: locally or with Docker.
Option 1 — Local
git clone https://github.com/GiovanniPasq/chunky.git
cd chunky
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
npm install -g @llamaindex/liteparse # optional — only needed for the LiteParse converter
./start_all.sh
Option 2 — Docker
git clone https://github.com/GiovanniPasq/chunky.git
cd chunky
docker compose up --build
| Service | URL | |----------|----------------------------| | Frontend | http://localhost:5173 | | Backend | http://localhost:8000 | | Swagger | http://localhost:8000/docs |
PDF → Markdown Converters
No single converter wins on every document type. Chunky ships with four — switch between them in the UI and re-convert without losing your settings.
| Converter | Library | Best for |
|-----------|---------|----------|
| PyMuPDF | pymupdf4llm | Fast conversion of standard digital PDFs with selectable text |
| Docling | docling | Complex layouts: multi-column documents, tables, and figures |
| MarkItDown | markitdown[all] | Broad-format documents, simple and deterministic output |
| LiteParse | liteparse | Fast, lightweight parsing by LlamaIndex — good for standard documents |
| VLM | openai + any vision model | Scanned PDFs, handwriting, diagrams — anything a human can read |
Note: The LiteParse converter requires Node.js and the CLI installed separately:
npm install -g @llamaindex/liteparse
VLM converter
The VLM converter rasterises each page at 300 DPI and sends it to any OpenAI-compatible vision model. By default it targets a locally running Ollama instance — no API key, no internet access required.
# Default — Ollama (local, no API key needed)
VLMConverter()
# Different local model
VLMConverter(model="minicpm-v")
# OpenAI
VLMConverter(model="gpt-4o", base_url="https://api.openai.com/v1", api_key="sk-...")
# Google Gemini
VLMConverter(model="gemini-2.5-flash", base_url="https://generativelanguage.googleapis.com/v1beta/openai/",api_key="AIza...")
VLM conversions report per-page progress, which the frontend polls via GET /api/convert-progress/{filename}.
Note: Conversion speed with Docling or a locally running Ollama instance depends heavily on available hardware. On CPU-only machines, both can be significantly slower than on systems with a dedicated GPU.
Ollama configuration: when using a local Ollama instance, the most relevant environment variables are
OLLAMA_NUM_PARALLEL,OLLAMA_MAX_LOADED_MODELS,OLLAMA_KEEP_ALIVE, andOLLAMA_MAX_QUEUE. See the Ollama FAQ for setup instructions.
Chunking Strategies
Chunky supports two splitting libraries, each exposing multiple strategies. The library and strategy are selected independently in the UI.
LangChain (langchain-text-splitters)
| Strategy | Description |
|----------|-------------|
| Token | Splits on token boundaries via tiktoken. Ideal for LLM context-window management. |
| Recursive | Tries paragraph → sentence → word boundaries in order. |
| Character | Splits on \n\n paragraphs, falls back to chunk_size characters. |
| Markdown | Two-phase split: H1/H2/H3 headers first, then optional size cap via RecursiveCharacterTextSplitter. |
Chonkie
| Strategy | Description |
|----------|-------------|
| Token | Splits on token boundaries. Fast, no external tokeniser needed. |
| Fast | SIMD-accelerated byte-based chunking at 100+ GB/s. Best for high-throughput pipelines. |
| Sentence | Splits at sentence boundaries. Preserves semantic completeness. |
| Recursive | Recursively splits using structural delimiters (paragraphs → sentences → words). Note: chunk_overlap is not supported. |
| Table | Splits large Markdown tables by row while preserving headers. Ideal for tabular data. |
| Code | Splits source code using AST-based structural analysis. Supports multiple languages. |
| Semantic | Groups content by embedding similarity. Best for preserving topical coherence. |
| Neural | Uses a fine-tuned BERT model to detect semantic shifts. Great for topic-coherent chunks. |
Note: The Semantic and Neural strategies download ML models on first use and may be slow to initialise.
Enrichment (beta)
⚠️ Enrichment features are currently in beta and may change in future releases.
Chunky includes an LLM-powered enrichment layer that operates at two levels of the pipeline.
Markdown enrichment
Before chunking, you can run enrichment directly on the converted Markdown. This step cleans up residual conversion artifacts — noise, formatting inconsistencies, extraction errors — producing a polished document that leads to cleaner, more coherent chunks downstream.
Chunk enrichment
After chunking, each chunk can be enriched independently via an LLM call. The pipeline populates the following fields:
| Field | Description |
|-------|-------------|
| cleaned_chunk | Cleaned and normalized version of the original text |
| title | Short descriptive title for the chunk |
| context | One sentence describing where the chunk fits within the broader document |
| summary | One sentence summary of the chunk content |
| keywords | Array of relevant keyword strings |
| questions | Array of questions this chunk could answer |
The context field is inspired by Anthropic's Contextual Retrieval technique, which shows that prepending a short chunk-specific context can reduce retrieval failure rates by up to 49%.
The questions field addresses a complementary problem: pre-generating the questions a chunk can answer produces embeddings much closer to real user queries at retrieval time, as highlighted in the Microsoft Azure RAG enrichment guide.
Extending Chunky
The converter and splitter layers use a decorator-based registry: adding a new converter or splitter automatically exposes it through the /api/capabilities endpoint and the UI — no frontend changes needed.
Adding a new converter
Every converter inherits from PDFConverter (backend/converters/base.py):
from abc import ABC, abstractmethod
from pathlib import Path
class PDFConverter(ABC):
@abstractmethod
def convert(self, pdf_path: Path) -> str:
"""Convert a PDF to a Markdown string."""
def validate_path(self, pdf_path: Path) -> Non
