PdfGptIndexer
RAG based tool for indexing and searching PDF text data using OpenAI API and FAISS (Facebook AI Similarity Search) index, designed for rapid information retrieval and superior search accuracy.
Install / Use
/learn @raghavan/PdfGptIndexerREADME
PdfGptIndexer
PdfGptIndexer was featured at the top of Hacker News! <img width="1139" alt="Screenshot 2024-05-18 at 9 38 18 AM" src="https://github.com/raghavan/raghavan/assets/131585/24215a9a-d423-45a8-8c4d-d9ee8b1ec752">
Description
PdfGptIndexer is an efficient tool for indexing and querying PDF documents using OpenAI embeddings and FAISS (Facebook AI Similarity Search). It implements a RAG (Retrieval Augmented Generation) system that allows you to have intelligent conversations with your PDF documents. The software is designed for rapid information retrieval with superior search accuracy.
How It Works
PdfGptIndexer consists of two main components:
1. Indexer (indexer.py) - One-time PDF Processing
The indexer processes your PDF documents and creates a searchable vector database:
- Extract Text: Uses PyMuPDF to extract text from all PDF files in a folder
- Chunk Text: Splits documents into manageable chunks (1000 characters with 200-character overlap) using LangChain's RecursiveCharacterTextSplitter
- Generate Embeddings: Creates vector embeddings for each chunk using OpenAI's
text-embedding-ada-002model - Store Locally: Saves the embeddings in a FAISS index on disk for fast retrieval
2. Chatbot (chatbot.py) - Interactive Q&A Interface
The chatbot provides an intelligent interface to query your indexed documents:
- Load Index: Loads the pre-computed FAISS vector index from disk
- Semantic Search: Converts your question into an embedding and finds the top 3 most similar document chunks
- Display Matches: Shows you the similarity scores and text snippets from matched documents
- Generate Answer: Uses GPT-4 to synthesize a coherent answer based on the retrieved context
Advantages of Storing Embeddings Locally
Storing embeddings locally provides several key benefits:
- Speed: Retrieval is significantly faster as embeddings are pre-computed—no need to regenerate them for each query
- Offline Access: After initial creation, query your data without internet access to OpenAI (only the answer generation requires API calls)
- Cost Savings: Compute embeddings once and reuse them, saving on API costs
- Scalability: Makes it feasible to work with large document collections that would be expensive to process in real-time
Getting Started
Prerequisites
- Python 3.8 or higher
- OpenAI API key
1. Installation
Clone the repository:
git clone https://github.com/raghavan/PdfGptIndexer.git
cd PdfGptIndexer
Install dependencies:
pip install -r requirements.txt
Or install manually:
pip install langchain langchain-openai langchain-community langchain-text-splitters openai pymupdf faiss-cpu python-dotenv tiktoken
2. Configuration
Create a .env file in the project root and add your OpenAI API key:
OPENAI_API_KEY=your_openai_api_key_here
3. Prepare Your PDFs
Place your PDF files in the pdf/ folder (or any folder of your choice).
Usage
Step 1: Index Your PDFs
Run the indexer to process your PDFs and create the vector database:
python indexer.py
Or specify a custom PDF folder:
python indexer.py /path/to/your/pdfs
Or specify both custom PDF folder and index location:
python indexer.py /path/to/your/pdfs /path/to/save/index
What happens:
- Extracts text from all PDFs in the folder
- Creates text chunks with metadata
- Generates embeddings using OpenAI
- Saves the FAISS index to
faiss_index/(or your specified location)
Note: You only need to run this once, or when you add new PDFs to your collection.
Step 2: Query Your Documents
Start the interactive chatbot:
python chatbot.py
Or specify a custom index location:
python chatbot.py /path/to/your/index
