PageIndex
π PageIndex: Document Index for Vectorless, Reasoning-based RAG
Install / Use
/learn @VectifyAI/PageIndexREADME
PageIndex: Vectorless, Reasoning-based RAG
<p align="center"><b>Reasoning-based RAG β¦ No Vector DB β¦ No Chunking β¦ Human-like Retrieval</b></p> <h4 align="center"> <a href="https://vectify.ai">π Homepage</a> β’ <a href="https://chat.pageindex.ai">π₯οΈ Chat Platform</a> β’ <a href="https://pageindex.ai/mcp">π MCP</a> β’ <a href="https://docs.pageindex.ai">π Docs</a> β’ <a href="https://discord.com/invite/VuXuf29EUj">π¬ Discord</a> β’ <a href="https://ii2abc2jejf.typeform.com/to/tK3AXl8T">βοΈ Contact</a> </h4> </div> <details open> <summary><h2>π’ Updates</h2></summary>- π₯ Agentic Vectorless RAG Example: A complete agentic, vectorless RAG example with self-hosted PageIndex, using OpenAI Agents SDK.
- PageIndex Chat: A Human-like document analysis agent platform for professional long documents. Also available via MCP or API.
- PageIndex Framework: The PageIndex framework β an agentic, in-context tree index that enables LLMs to perform reasoning-based, human-like retrieval over long documents.
π Introduction to PageIndex
Are you frustrated with vector database retrieval accuracy for long professional documents? Traditional vector-based RAG relies on semantic similarity rather than true relevance. But similarity β relevance β what we truly need in retrieval is relevance, and that requires reasoning. When working with professional documents that demand domain expertise and multi-step reasoning, similarity search often falls short.
Inspired by AlphaGo, we propose PageIndex β a vectorless, reasoning-based RAG system that builds a hierarchical tree index from long documents and uses LLMs to reason over that index for agentic, context-aware retrieval. It simulates how human experts navigate and extract knowledge from complex documents through tree search, enabling LLMs to think and reason their way to the most relevant document sections. PageIndex performs retrieval in two steps:
- Generate a βTable-of-Contentsβ tree structure index of documents
- Perform reasoning-based retrieval through tree search
π― Core Features
Compared to traditional vector-based RAG, PageIndex features:
- No Vector DB: Uses document structure and LLM reasoning for retrieval, instead of vector similarity search.
- No Chunking: Documents are organized into natural sections, not artificial chunks.
- Human-like Retrieval: Simulates how human experts navigate and extract knowledge from complex documents.
- Better Explainability and Traceability: Retrieval is based on reasoning β traceable and interpretable, with page and section references. No more opaque, approximate vector search (βvibe retrievalβ).
PageIndex powers a reasoning-based RAG system that achieved state-of-the-art 98.7% accuracy on FinanceBench, demonstrating superior performance over vector-based RAG solutions in professional document analysis (see our blog post for details).
π Explore PageIndex
To learn more, please see a detailed introduction of the PageIndex framework. Check out this GitHub repo for open-source code, and the cookbooks, tutorials, and blog for additional usage guides and examples.
The PageIndex service is available as a ChatGPT-style chat platform, or can be integrated via MCP or API.
π οΈ Deployment Options
- Self-host β run locally with this open-source repo.
- Cloud Service β try instantly with our Chat Platform, or integrate with MCP or API.
- Enterprise β private or on-prem deployment. Contact us or book a demo for more details.
π§ͺ Quick Hands-on
- π₯ Agentic Vectorless RAG (latest) β a complete agentic vectorless RAG example with self-hosted PageIndex, using OpenAI Agents SDK.
- Try the Vectorless RAG notebook β a minimal, hands-on example of reasoning-based RAG using PageIndex.
- Check out Vision-based Vectorless RAG β no OCR; a minimal, vision-based & reasoning-native RAG pipeline that works directly over page images.
π² PageIndex Tree Structure
PageIndex can transform lengthy PDF documents into a semantic tree structure, similar to a "table of contents" but optimized for use with Large Language Models (LLMs). It's ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits.
Below is an example PageIndex tree structure. Also see more example documents and generated tree structures.
...
{
"title": "Financial Stability",
"node_id": "0006",
"start_index": 21,
"end_index": 22,
"summary": "The Federal Reserve ...",
"nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"start_index": 22,
"end_index": 28,
"summary": "The Federal Reserve's monitoring ..."
},
{
"title": "Domestic and International Cooperation and Coordination",
"node_id": "0008",
"start_index": 28,
"end_index": 31,
"summary": "In 2023, the Federal Reserve collaborated ..."
}
]
}
...
You can generate the PageIndex tree structure with this open-source repo, or use our API.
βοΈ Package Usage
You can follow these steps to generate a PageIndex tree from a PDF document.
1. Install dependencies
pip3 install --upgrade -r requirements.txt
2. Set your LLM API key
Create a .env file in the root directory with your LLM API key, with multi-LLM support via LiteLLM:
OPENAI_API_KEY=your_openai_key_here
3. Generate PageIndex structure for your PDF
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
<details>
<summary>Optional parameters</summary>
<br>
You can customize the processing with additional optional arguments:
--model LLM model to use (default: gpt-4o-2024-11-20)
--toc-check-pages Pages to check for table of contents (default: 20)
--max-pages-per-node Max pages per node (default: 10)
--max-tokens-per-node Max tokens per node (default: 20000)
--if-add-node-id Add node ID (yes/no, default: yes)
--if-add-node-summary Add node summary (yes/no, default: yes)
--if-add-doc-description Add doc description (yes/no, default: yes)
</details>
<details>
<summary>Markdown support</summary>
<br>
We also provide markdown support for PageIndex.