SmartChunk
SmartChunk is a lightweight, structure-aware semantic chunking toolkit designed to supercharge RAG (Retrieval-Augmented Generation) and LLM pipelines. Unlike naive splitters that break text arbitrarily, SmartChunk respects document structure (headings, lists, tables, code blocks) and semantic flow, ensuring cleaner, more coherent chunks.
Install / Use
/learn @ayush585/SmartChunkREADME
SmartChunk 🧩
Structure-aware semantic chunking for RAG/LLMs (test.pypi.org/project/smartchunk/)
SmartChunk is a Python package + CLI that creates higher-quality chunks for Retrieval-Augmented Generation (RAG) pipelines. Instead of breaking text blindly, SmartChunk respects structure and meaning — no more chopped sentences, broken code blocks, or messy lists.
The result? 👉 Better retrieval quality 👉 Lower token costs 👉 Chunks your LLM can actually understand
✨ Why SmartChunk?
Naive splitters cut text every N tokens. That causes:
- ❌ Broken headings, lists, or tables
- ❌ Incoherent fragments across paragraphs
- ❌ Duplicate/boilerplate content bloating your index
SmartChunk fixes this by combining structure awareness + semantic similarity.
🧠 Key Features
- Structure-Aware Splitting: Never slices through a heading, list, table, or fenced code block.
- Semantic Boundary Detection: Uses embeddings to find natural breakpoints between topics.
- Noise & Duplication Guard: Strips headers/footers, removes near-duplicates, normalizes whitespace.
- Flexible & Tunable: Control chunk size, overlap, and semantic sensitivity to fit your pipeline.
- End-to-End Ready: From URL → parsed → cleaned → JSONL chunks in one command.
⚡ Quickstart
1. Install
For hackathon/demo (TestPyPI):
pip install -i https://test.pypi.org/simple/ smartchunk
Once we'll publish it to PyPI:
pip install smartchunk
2. Chunk a Document
smartchunk chunk README.md \
--mode markdown \
--max-tokens 500 \
--overlap 100 \
--semantic \
--semantic-model all-MiniLM-L6-v2 \
--format jsonl \
--output chunks.jsonl
3. Fetch & Chunk a URL
smartchunk fetch "https://en.wikipedia.org/wiki/Crayon_Shin-chan" \
--semantic \
--semantic-model all-MiniLM-L6-v2 \
--format table
4. Compare with a Naive Splitter
smartchunk compare README.md --mode markdown --max-chars 800
Prints a terminal table comparing naive vs SmartChunk side-by-side.
📦 Example Output
Each line in the .jsonl output is a coherent chunk with rich metadata:
{
"id": "c0033",
"text": "###### Opening\n\n \n [\n\n \n edit\n\n \n ]\n\n* Footage from Japanese opening 8 (\"PLEASURE\") but with
completely different lyrics, to the melody of a techno remix of Japanese opening 3 (\"Ora wa Ninkimono\").Musical Director, Producer and
English Director: World Worm Studios composerGary Gibbons",
"header_path": "Media / Anime / Music / LUK Internacional dub / Opening",
"start_line": 709,
"end_line": 727
},
💻 CLI Overview
fetch→ Fetch, parse & chunk a URL in one gochunk→ Chunk a local filecompare→ Compare SmartChunk vs naive splitter (HTML report)stream→ Stream chunks from STDIN in real-time
Run smartchunk --help for full options.
🤝 Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines. By participating, you agree to follow our Code of Conduct.
🔑 License
MIT License. Free to use, modify, and share.
(In Simple Words) 📝
SmartChunk = “Don’t let your RAG cut sentences in half.” It’s the first step for any production-grade RAG pipeline: clean, coherent, AI-ready chunks.
