HashIndex ⚡️

Ultra-fast, LLM-optimized document indexing in Python.

Built by the team at Pardus AI – The fastest AI Data Analysis Platform.

HashIndex is the core indexing engine we use at Pardus AI to process 50MB+ CSVs and PDFs in seconds. We are open-sourcing our Python implementation so you can build better RAG pipelines without the bloat of LangChain.

Want to analyze documents without coding? Try our no-code platform: Pardus AI Dashboard (Free for huge files).

Installation

# Clone the repository
git clone https://github.com/JasonHonKL/HashIndex.git
cd HashIndex

# Install with uv (recommended - faster and more reliable)
uv venv                    # Create virtual environment
uv sync                    # Install dependencies and package in editable mode
source .venv/bin/activate  # Activate the virtual environment (Linux/Mac)
# or
.venv\Scripts\activate     # Activate the virtual environment (Windows)

# Alternatively, install with pip
pip install -e .

Usage

As a Python Library

from hashindex import index_pdf, query_index, HashIndex

# Index a PDF document
index = index_pdf("document.pdf")

# Save the index
index.save("document.index.json")

# Load an existing index
index = HashIndex.load("document.index.json")

# Query the index
answer = query_index(index, "What is the main conclusion?")
print(answer)

Using the CLI

cp .env.example .env

then just modify the config we support almost all api !

Advanced Usage

from hashindex import HashIndex, Model, ListKeys, GetSummary, GetContent

# Create a custom model
model = Model(model="anthic/claude-3.5-sonnet")

# Work with index objects directly
index = HashIndex()
# ... customize indexing logic ...

# Use verbose=False for silent operation
from hashindex import index_pdf, query_index
index = index_pdf("document.pdf", verbose=False)
answer = query_index(index, "Your question", verbose=False)

# Access pages directly
for key, obj in index.PageTable.items():
    print(f"{key}: {obj.summary}")

Comparative Analysis

HashIndex outperforms standard paradigms in specific Long-Context Narrative tasks where causality matters more than keyword matching.

| Method | Topology | Context Management | Robustness (Unstructured Data) | Latency | | ---------------- | ------------------- | --------------------------- | ------------------------------ | --------------- | | Vector RAG | Disconnected Chunks | Additive (FIFO overflow) | High | Low (O(1)) | | PageIndex | Hierarchical Tree | Path-Dependent | Low (Requires Clean Headers) | High (O(log n)) | | RAPTOR | Recursive Tree | Cluster-Based | Medium | Medium | | HashIndex (Ours) | Hash Table | Dynamic Pruning (Agent-led) | High (Mechanical Split) | Medium-Low |

By treating document chunks as Hash Table entries rather than Vector Embeddings, HashIndex avoids the 'Lost in the Middle' phenomenon common in vector search.

Citation

If you use HashIndex in your research or project, please cite it as follows:

@software{HashIndex2026,
  author = {Hon, Jason and Pardus AI Team},
  title = {HashIndex: LLM-optimized Document Indexing without vector search},
  year = {2026},
  publisher = {Pardus AI}
}

HashIndex

Install / Use

README