LumberChunker 🪓

This is the official repository for the paper LumberChunker: Long-Form Narrative Document Segmentation by André V. Duarte, João D.S. Marques, Miguel Graça, Miguel Freire, Lei Li and Arlindo L. Oliveira

Quick links:
Paper | Blog Post | GutenQA Dataset

LumberChunker is a method leveraging an LLM to dynamically segment documents into semantically independent chunks. It iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift.

GitHub Logo

LumberChunker Example - Segmenting a Book

⚠ Important: Whether using Gemini or ChatGPT, don't forget to add the API key / (Project ID, Location) in LumberChunker-Segmentation.py

python LumberChunker-Segmentation.py --out_path <output directory path> --model_type <Gemini | ChatGPT> --book_name <target book name>

📚 GutenQA

GutenQA consists of book passages manually extracted from Project Gutenberg and subsequently segmented using LumberChunker. It features: 100 Public Domain Narrative Books and 30 Question-Answer Pairs per Book.

The dataset is organized into the following columns:

Book Name: The title of the book from which the passage is extracted.
Book ID: A unique integer identifier assigned to each book.
Chunk ID: An integer identifier for each chunk of the book. Chunks are listed in the sequence they appear in the book.
Chapter: The name(s) of the chapter(s) from which the chunk is derived. If LumberChunker merged paragraphs from multiple chapters, the names of all relevant chapters are included.
Question: A question pertaining to the specific chunk of text. Note that not every chunk has an associated question, as only 30 questions are generated per book.
Answer: The answer corresponding to the question related to that chunk.
Chunk Must Contain: A specific substring from the chunk indicating where the answer can be found. This ensures that, despite the chunking methodology, the correct chunk includes this particular string.

📖 GutenQA Alternative Chunking Formats (Used for Baseline Methods)

We also release the same corpus present on GutenQA with different chunk granularities.

Paragraph: Books are extracted manually from Project Gutenberg. This is the format of the extraction prior to segmentation with LumberChunker.
Recursive Chunks: Documents are segmented based on a hierarchy of separators such as paragraph breaks, new lines, spaces, and individual characters, using Langchain's RecursiveCharacterTextSplitter function.
Semantic Chunks: Paragraph Chunks are embedded with OpenAI's text-ada-embedding-002. Text is segmented by identifying break points based on significant changes in adjacent chunks embedding distances.
Propositions: Text is segmented as introduced in the paper Dense X Retrieval. Generated questions are provided along with the correct Proposition Answer.

🤝 Compatibility

LumberChunker is compatible with any LLM with strong reasoning capabilities.

In our code, we provide implementation for Gemini and ChatGPT, but in fact models like LLaMA-3, Mixtral 8x7B or Command+R can also be used.

💬 Citation

If you find this work useful, please consider citing our paper:

@inproceedings{duarte-etal-2024-lumberchunker,
    title = "{L}umber{C}hunker: Long-Form Narrative Document Segmentation",
    author = "Duarte, Andr{\'e} V.  and Marques, Jo{\~a}o DS  and Gra{\c{c}}a, Miguel  and Freire, Miguel  and Li, Lei  and Oliveira, Arlindo L.",
    editor = "Al-Onaizan, Yaser  and Bansal, Mohit  and Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.377/",
    doi = "10.18653/v1/2024.findings-emnlp.377",
    pages = "6473--6486"
}

LumberChunker

Install / Use

README