PdfGptIndexer

PdfGptIndexer was featured at the top of Hacker News! <img width="1139" alt="Screenshot 2024-05-18 at 9 38 18 AM" src="https://github.com/raghavan/raghavan/assets/131585/24215a9a-d423-45a8-8c4d-d9ee8b1ec752">

Description

PdfGptIndexer is an efficient tool for indexing and querying PDF documents using OpenAI embeddings and FAISS (Facebook AI Similarity Search). It implements a RAG (Retrieval Augmented Generation) system that allows you to have intelligent conversations with your PDF documents. The software is designed for rapid information retrieval with superior search accuracy.

How It Works

PdfGptIndexer consists of two main components:

1. Indexer (`indexer.py`) - One-time PDF Processing

The indexer processes your PDF documents and creates a searchable vector database:

Extract Text: Uses PyMuPDF to extract text from all PDF files in a folder
Chunk Text: Splits documents into manageable chunks (1000 characters with 200-character overlap) using LangChain's RecursiveCharacterTextSplitter
Generate Embeddings: Creates vector embeddings for each chunk using OpenAI's text-embedding-ada-002 model
Store Locally: Saves the embeddings in a FAISS index on disk for fast retrieval

2. Chatbot (`chatbot.py`) - Interactive Q&A Interface

The chatbot provides an intelligent interface to query your indexed documents:

Load Index: Loads the pre-computed FAISS vector index from disk
Semantic Search: Converts your question into an embedding and finds the top 3 most similar document chunks
Display Matches: Shows you the similarity scores and text snippets from matched documents
Generate Answer: Uses GPT-4 to synthesize a coherent answer based on the retrieved context

Untitled-2023-06-16-1537

Advantages of Storing Embeddings Locally

Storing embeddings locally provides several key benefits:

Speed: Retrieval is significantly faster as embeddings are pre-computed—no need to regenerate them for each query
Offline Access: After initial creation, query your data without internet access to OpenAI (only the answer generation requires API calls)
Cost Savings: Compute embeddings once and reuse them, saving on API costs
Scalability: Makes it feasible to work with large document collections that would be expensive to process in real-time

Getting Started

Prerequisites

Python 3.8 or higher
OpenAI API key

1. Installation

Clone the repository:

git clone https://github.com/raghavan/PdfGptIndexer.git
cd PdfGptIndexer

Install dependencies:

pip install -r requirements.txt

Or install manually:

pip install langchain langchain-openai langchain-community langchain-text-splitters openai pymupdf faiss-cpu python-dotenv tiktoken

2. Configuration

Create a .env file in the project root and add your OpenAI API key:

OPENAI_API_KEY=your_openai_api_key_here

3. Prepare Your PDFs

Place your PDF files in the pdf/ folder (or any folder of your choice).

Usage

Step 1: Index Your PDFs

Run the indexer to process your PDFs and create the vector database:

python indexer.py

Or specify a custom PDF folder:

python indexer.py /path/to/your/pdfs

Or specify both custom PDF folder and index location:

python indexer.py /path/to/your/pdfs /path/to/save/index

What happens:

Extracts text from all PDFs in the folder
Creates text chunks with metadata
Generates embeddings using OpenAI
Saves the FAISS index to faiss_index/ (or your specified location)

Note: You only need to run this once, or when you add new PDFs to your collection.

Step 2: Query Your Documents

Start the interactive chatbot:

python chatbot.py

Or specify a custom index location:

python chatbot.py /path/to/your/index

PdfGptIndexer

Install / Use

README

PdfGptIndexer

Description

How It Works

1. Indexer (`indexer.py`) - One-time PDF Processing

2. Chatbot (`chatbot.py`) - Interactive Q&A Interface

Advantages of Storing Embeddings Locally

Getting Started

Prerequisites

1. Installation

2. Configuration

3. Prepare Your PDFs

Usage

Step 1: Index Your PDFs

Step 2: Query Your Documents

PdfGptIndexer

Install / Use

README

PdfGptIndexer

Description

How It Works

1. Indexer (indexer.py) - One-time PDF Processing

2. Chatbot (chatbot.py) - Interactive Q&A Interface

Advantages of Storing Embeddings Locally

Getting Started

Prerequisites

1. Installation

2. Configuration

3. Prepare Your PDFs

Usage

Step 1: Index Your PDFs

Step 2: Query Your Documents

1. Indexer (`indexer.py`) - One-time PDF Processing

2. Chatbot (`chatbot.py`) - Interactive Q&A Interface