LumberChunker
This repository presents the original implementation of LumberChunker: Long-Form Narrative Document Segmentation by André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li and Arlindo L. Oliveira (accepted at EMNLP 2024 Findings)
Install / Use
/learn @joaodsmarques/LumberChunkerREADME
LumberChunker 🪓
This is the official repository for the paper LumberChunker: Long-Form Narrative Document Segmentation by André V. Duarte, João D.S. Marques, Miguel Graça, Miguel Freire, Lei Li and Arlindo L. Oliveira<br>
Quick links:
Paper | Blog Post | GutenQA Dataset
LumberChunker is a method leveraging an LLM to dynamically segment documents into semantically independent chunks. It iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift.

LumberChunker Example - Segmenting a Book
⚠ Important: Whether using Gemini or ChatGPT, don't forget to add the API key / (Project ID, Location) in LumberChunker-Segmentation.py<br>
python LumberChunker-Segmentation.py --out_path <output directory path> --model_type <Gemini | ChatGPT> --book_name <target book name>
📚 GutenQA
GutenQA consists of book passages manually extracted from Project Gutenberg and subsequently segmented using LumberChunker.<br> It features: 100 Public Domain Narrative Books and 30 Question-Answer Pairs per Book.<br>
The dataset is organized into the following columns:
Book Name: The title of the book from which the passage is extracted.Book ID: A unique integer identifier assigned to each book.Chunk ID: An integer identifier for each chunk of the book. Chunks are listed in the sequence they appear in the book.Chapter: The name(s) of the chapter(s) from which the chunk is derived. If LumberChunker merged paragraphs from multiple chapters, the names of all relevant chapters are included.Question: A question pertaining to the specific chunk of text. Note that not every chunk has an associated question, as only 30 questions are generated per book.Answer: The answer corresponding to the question related to that chunk.Chunk Must Contain: A specific substring from the chunk indicating where the answer can be found. This ensures that, despite the chunking methodology, the correct chunk includes this particular string.
📖 GutenQA Alternative Chunking Formats (Used for Baseline Methods)
We also release the same corpus present on GutenQA with different chunk granularities.
- Paragraph: Books are extracted manually from Project Gutenberg. This is the format of the extraction prior to segmentation with LumberChunker.
- Recursive Chunks: Documents are segmented based on a hierarchy of separators such as paragraph breaks, new lines, spaces, and individual characters, using Langchain's RecursiveCharacterTextSplitter function.
- Semantic Chunks: Paragraph Chunks are embedded with OpenAI's text-ada-embedding-002. Text is segmented by identifying break points based on significant changes in adjacent chunks embedding distances.
- Propositions: Text is segmented as introduced in the paper Dense X Retrieval. Generated questions are provided along with the correct Proposition Answer.
🤝 Compatibility
LumberChunker is compatible with any LLM with strong reasoning capabilities.<br>
- In our code, we provide implementation for Gemini and ChatGPT, but in fact models like LLaMA-3, Mixtral 8x7B or Command+R can also be used.<br>
💬 Citation
If you find this work useful, please consider citing our paper:
@inproceedings{duarte-etal-2024-lumberchunker,
title = "{L}umber{C}hunker: Long-Form Narrative Document Segmentation",
author = "Duarte, Andr{\'e} V. and Marques, Jo{\~a}o DS and Gra{\c{c}}a, Miguel and Freire, Miguel and Li, Lei and Oliveira, Arlindo L.",
editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.377/",
doi = "10.18653/v1/2024.findings-emnlp.377",
pages = "6473--6486"
}
