📚 RAGIndex: Retrieval Augmented Generation (RAG) with LlamaIndex & Streamlit

Transform your documents into an intelligent Q&A system using LlamaIndex RAG capabilities and Streamlit's interactive interface. Upload PDFs, DOCX, or TXT files and get instant, contextual answers powered by advanced AI embeddings.

https://github.com/user-attachments/assets/a59324bf-1e8a-4dc5-a3b6-7cc32c9ad31f

🎯 What is RAGIndex?

RAGIndex is a Retrieval-Augmented Generation (RAG) application that leverages LlamaIndex for document processing and Streamlit for the user interface. It transforms static documents into an interactive knowledge base where you can ask questions and receive accurate, context-aware answers.

🔥 Key Features

🚀 LlamaIndex-Powered RAG: Advanced document indexing and retrieval using LlamaIndex's state-of-the-art RAG pipeline
💻 Streamlit Web Interface: Beautiful, responsive UI built with Streamlit for seamless user experience
📄 Advanced Document Processing: Multi-format support (PDF, DOCX, TXT) with intelligent text extraction and metadata preservation
🔍 Intelligent PDF Ingestion: Sophisticated PDF processing with page-level tracking, automatic fallback mechanisms, and detailed metadata retention
🧠 Smart OCR Pipeline: Automatic OCR processing for image-based PDFs using Tesseract with custom configuration and error handling
📊 Document Tracking & Deduplication: Advanced document store with Redis-backed tracking, duplicate detection, and ingestion state management
⚡ High-Performance Vector Storage: Redis vector store with semantic search, metadata fields, and optimized retrieval
🔄 Real-time Processing: Document processing with progress tracking and detailed ingestion statistics
🎨 Modern UI: Clean, intuitive interface with chat-style interactions and comprehensive error feedback
🐳 Containerized Deployment: Fully containerized with Docker Compose for easy setup and deployment

🏗️ Architecture

LlamaIndex Integration

Embedding Model: bge-base-en-v1.5 for high-quality text embeddings with semantic splitting
Vector Store: Redis-backed vector storage with metadata fields for source tracking and page numbering
Document Processing: Semantic-aware text chunking with intelligent overlap and page boundary preservation
Document Store: Redis document store with duplicate detection and ingestion state tracking
Query Engine: LlamaIndex conversation engine for contextual responses with source attribution
Ingestion Pipeline: Advanced pipeline with caching, error handling, and automatic retry mechanisms

Streamlit Frontend

Interactive File Upload: Multi-file upload with progress tracking
Real-time Chat: Chat-style interface for natural Q&A interactions
Session Management: Persistent conversation state across interactions
Responsive Design: Modern, mobile-friendly interface

🚀 Quick Start

Prerequisites

Docker and Docker Compose
4GB+ RAM recommended
Internet connection for model downloads

1. Clone the Repository

git clone https://github.com/rigvedrs/RAGIndex.git
cd RAGIndex

2. Environment Setup

Create a .env file with your OpenAI API key:

echo "OPENAI_API_KEY=your_openai_api_key_here" > .env

3. Launch with Docker

docker compose up --build

4. Access the Application

Open your browser and navigate to: http://localhost:8501

📖 How to Use

Upload Documents: Use the sidebar to upload PDF, DOCX, or TXT files
Process Documents: Click "Analyze" to process and index your documents
Ask Questions: Type your questions in the chat interface
Get Answers: Receive contextual answers based on your documents

🔍 Advanced PDF Ingestion Features

RAGIndex implements a sophisticated PDF processing pipeline that goes far beyond basic text extraction:

📄 Intelligent Document Processing

Page-Level Tracking: Each page is individually processed with embedded page numbers (PAGE_NUM=1, PAGE_NUM=2, etc.) for precise source attribution
Metadata Preservation: Complete document metadata including source filename, page numbers, and processing timestamps
Semantic Chunking: Uses LlamaIndex's SemanticSplitterNodeParser for intelligent content-aware splitting rather than naive character limits

🔄 Multi-Stage Processing Pipeline

Primary Extraction: PyPDF2-based text extraction for standard PDFs
OCR Fallback: Automatic detection of image-based PDFs with Tesseract OCR processing
Format Conversion: DOCX and TXT files automatically converted to PDF format for consistent processing
Quality Validation: Empty document detection with automatic fallback to OCR processing

🛡️ Robust Error Handling & Recovery

Automatic Retry Logic: Failed document ingestion automatically triggers cleanup and retry mechanisms
Memory Management: OutOfMemoryError handling with graceful degradation
Document Store Cleanup: Automatic removal of partially processed documents to maintain data integrity
Progress Tracking: Real-time feedback with detailed statistics on node generation and ingestion success

📊 Advanced Document Store Management

Duplicate Detection: DocstoreStrategy.DUPLICATES_ONLY prevents re-processing of identical documents
Redis-Backed Storage: High-performance document storage with persistence and scalability
Ingestion Caching: Intelligent caching system to speed up repeated operations
Metadata Indexing: Searchable metadata fields including source attribution and page references

🎯 Precision Source Attribution

When you ask questions, RAGIndex doesn't just provide answers—it tells you exactly which document and page the information came from, enabling:

Citation Accuracy: Precise page-level source references
Content Verification: Easy verification of AI responses against source documents
Context Preservation: Maintains document structure and page relationships

🛠️ Technology Stack

Core Technologies

LlamaIndex: Advanced RAG framework for document indexing and retrieval
Streamlit: Modern web app framework for data science and AI
Redis: In-memory vector database for high-performance search
HuggingFace Transformers: Pre-trained embedding models

Document Processing & Ingestion

PyPDF2: Primary PDF text extraction with page-level tracking
Tesseract OCR: Intelligent OCR fallback for image-based PDFs with custom configuration
pdf2image: High-quality PDF to image conversion for OCR processing
python-docx: DOCX file processing with automatic PDF conversion
PyMuPDF: Advanced PDF processing and metadata extraction
Document Tracking: Page number injection, source attribution, and metadata preservation
Error Handling: Robust fallback mechanisms and automatic retry logic
Deduplication: Document fingerprinting and duplicate prevention system

AI & ML

OpenAI API: Large language model integration
sentence-transformers: Text embedding generation
NLTK: Natural language processing utilities

⚙️ Configuration

Embedding Model Settings

[embed_model]
model_name = "BAAI/bge-base-en-v1.5"
cache_folder = "/RAGIndex/store/models"
embed_batch_size = 1

Document Chunking

[transformations]
chunk_size = 1000
chunk_overlap = 100

Redis Configuration

[redis]
host_name = 'redis'
port_no = 6379
doc_store_name = "DocStore_v1"
vector_index_name = "VecStore_v1"

🔧 Advanced Usage

Custom Embedding Models

Replace the embedding model in config.toml:

model_name = "your-custom-huggingface-model"

Scaling with Docker

For production deployment with multiple instances:

docker compose up -d --scale docqna=3

Note: Streamlit apps are single-threaded. Multiple instances can be run behind a load balancer (e.g., nginx) for better concurrency handling.

API Integration

The application can be extended with REST API endpoints for programmatic access.

🧪 Development

Local Development Setup

# Install dependencies
pip install -r requirements.txt

# Start Redis
docker run -d -p 6379:6379 redis/redis-stack-server:latest

# Run Streamlit app
streamlit run src/app.py

Project Structure

RAGIndex/
├── src/
│   ├── app.py                 # Main Streamlit application
│   └── RAGIndex/
│       ├── chat/              # LlamaIndex conversation engine
│       ├── pipeline/          # Document processing pipeline
│       ├── pdf_ingest/        # PDF processing utilities
│       └── stcomp/            # Streamlit components
├── config.toml                # Application configuration
├── requirements.txt           # Python dependencies
├── docker-compose.yml         # Docker deployment
└── Dockerfile                 # Container definition

🤝 Contributing

We welcome contributions!

Development Workflow

Fork the repository
C

RAGIndex

Install / Use

README