SkillAgentSearch skills...

SmartChunk

SmartChunk is a lightweight, structure-aware semantic chunking toolkit designed to supercharge RAG (Retrieval-Augmented Generation) and LLM pipelines. Unlike naive splitters that break text arbitrarily, SmartChunk respects document structure (headings, lists, tables, code blocks) and semantic flow, ensuring cleaner, more coherent chunks.

Install / Use

/learn @ayush585/SmartChunk
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

SmartChunk 🧩

Structure-aware semantic chunking for RAG/LLMs (test.pypi.org/project/smartchunk/)

SmartChunk is a Python package + CLI that creates higher-quality chunks for Retrieval-Augmented Generation (RAG) pipelines. Instead of breaking text blindly, SmartChunk respects structure and meaning — no more chopped sentences, broken code blocks, or messy lists.

The result? 👉 Better retrieval quality 👉 Lower token costs 👉 Chunks your LLM can actually understand


✨ Why SmartChunk?

Naive splitters cut text every N tokens. That causes:

  • ❌ Broken headings, lists, or tables
  • ❌ Incoherent fragments across paragraphs
  • ❌ Duplicate/boilerplate content bloating your index

SmartChunk fixes this by combining structure awareness + semantic similarity.


🧠 Key Features

  • Structure-Aware Splitting: Never slices through a heading, list, table, or fenced code block.
  • Semantic Boundary Detection: Uses embeddings to find natural breakpoints between topics.
  • Noise & Duplication Guard: Strips headers/footers, removes near-duplicates, normalizes whitespace.
  • Flexible & Tunable: Control chunk size, overlap, and semantic sensitivity to fit your pipeline.
  • End-to-End Ready: From URL → parsed → cleaned → JSONL chunks in one command.

⚡ Quickstart

1. Install

For hackathon/demo (TestPyPI):

pip install -i https://test.pypi.org/simple/ smartchunk

Once we'll publish it to PyPI:

pip install smartchunk

2. Chunk a Document

smartchunk chunk README.md \
  --mode markdown \
  --max-tokens 500 \
  --overlap 100 \
  --semantic \
  --semantic-model all-MiniLM-L6-v2 \
  --format jsonl \
  --output chunks.jsonl

3. Fetch & Chunk a URL

smartchunk fetch "https://en.wikipedia.org/wiki/Crayon_Shin-chan" \
  --semantic \
  --semantic-model all-MiniLM-L6-v2 \
  --format table

4. Compare with a Naive Splitter

smartchunk compare README.md --mode markdown --max-chars 800

Prints a terminal table comparing naive vs SmartChunk side-by-side.


📦 Example Output

Each line in the .jsonl output is a coherent chunk with rich metadata:

{
    "id": "c0033",
    "text": "###### Opening\n\n \n        [\n\n \n         edit\n\n \n        ]\n\n* Footage from Japanese opening 8 (\"PLEASURE\") but with 
completely different lyrics, to the melody of a techno remix of Japanese opening 3 (\"Ora wa Ninkimono\").Musical Director, Producer and 
English Director: World Worm Studios composerGary Gibbons",
    "header_path": "Media / Anime / Music / LUK Internacional dub / Opening",
    "start_line": 709,
    "end_line": 727
  },

💻 CLI Overview

  • fetch → Fetch, parse & chunk a URL in one go
  • chunk → Chunk a local file
  • compare → Compare SmartChunk vs naive splitter (HTML report)
  • stream → Stream chunks from STDIN in real-time

Run smartchunk --help for full options.


🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines. By participating, you agree to follow our Code of Conduct.


🔑 License

MIT License. Free to use, modify, and share.


(In Simple Words) 📝

SmartChunk = “Don’t let your RAG cut sentences in half.” It’s the first step for any production-grade RAG pipeline: clean, coherent, AI-ready chunks.

View on GitHub
GitHub Stars12
CategoryProduct
Updated2d ago
Forks1

Languages

Python

Security Score

95/100

Audited on Apr 3, 2026

No findings