RAGIndex
LlamaIndex Powered RAG for PDF, TXT and DOCX files with Tesseract OCR support, Semantic chunking, Document citations with direct page display, Advanced Caching and Duplicate Detection with Redis Vector DB
Install / Use
/learn @rigvedrs/RAGIndexREADME
📚 RAGIndex: Retrieval Augmented Generation (RAG) with LlamaIndex & Streamlit
Transform your documents into an intelligent Q&A system using LlamaIndex RAG capabilities and Streamlit's interactive interface. Upload PDFs, DOCX, or TXT files and get instant, contextual answers powered by advanced AI embeddings.
https://github.com/user-attachments/assets/a59324bf-1e8a-4dc5-a3b6-7cc32c9ad31f
🎯 What is RAGIndex?
RAGIndex is a Retrieval-Augmented Generation (RAG) application that leverages LlamaIndex for document processing and Streamlit for the user interface. It transforms static documents into an interactive knowledge base where you can ask questions and receive accurate, context-aware answers.
🔥 Key Features
- 🚀 LlamaIndex-Powered RAG: Advanced document indexing and retrieval using LlamaIndex's state-of-the-art RAG pipeline
- 💻 Streamlit Web Interface: Beautiful, responsive UI built with Streamlit for seamless user experience
- 📄 Advanced Document Processing: Multi-format support (PDF, DOCX, TXT) with intelligent text extraction and metadata preservation
- 🔍 Intelligent PDF Ingestion: Sophisticated PDF processing with page-level tracking, automatic fallback mechanisms, and detailed metadata retention
- 🧠 Smart OCR Pipeline: Automatic OCR processing for image-based PDFs using Tesseract with custom configuration and error handling
- 📊 Document Tracking & Deduplication: Advanced document store with Redis-backed tracking, duplicate detection, and ingestion state management
- ⚡ High-Performance Vector Storage: Redis vector store with semantic search, metadata fields, and optimized retrieval
- 🔄 Real-time Processing: Document processing with progress tracking and detailed ingestion statistics
- 🎨 Modern UI: Clean, intuitive interface with chat-style interactions and comprehensive error feedback
- 🐳 Containerized Deployment: Fully containerized with Docker Compose for easy setup and deployment
🏗️ Architecture
LlamaIndex Integration
- Embedding Model:
bge-base-en-v1.5for high-quality text embeddings with semantic splitting - Vector Store: Redis-backed vector storage with metadata fields for source tracking and page numbering
- Document Processing: Semantic-aware text chunking with intelligent overlap and page boundary preservation
- Document Store: Redis document store with duplicate detection and ingestion state tracking
- Query Engine: LlamaIndex conversation engine for contextual responses with source attribution
- Ingestion Pipeline: Advanced pipeline with caching, error handling, and automatic retry mechanisms
Streamlit Frontend
- Interactive File Upload: Multi-file upload with progress tracking
- Real-time Chat: Chat-style interface for natural Q&A interactions
- Session Management: Persistent conversation state across interactions
- Responsive Design: Modern, mobile-friendly interface
🚀 Quick Start
Prerequisites
- Docker and Docker Compose
- 4GB+ RAM recommended
- Internet connection for model downloads
1. Clone the Repository
git clone https://github.com/rigvedrs/RAGIndex.git
cd RAGIndex
2. Environment Setup
Create a .env file with your OpenAI API key:
echo "OPENAI_API_KEY=your_openai_api_key_here" > .env
3. Launch with Docker
docker compose up --build
4. Access the Application
Open your browser and navigate to: http://localhost:8501
📖 How to Use
- Upload Documents: Use the sidebar to upload PDF, DOCX, or TXT files
- Process Documents: Click "Analyze" to process and index your documents
- Ask Questions: Type your questions in the chat interface
- Get Answers: Receive contextual answers based on your documents
🔍 Advanced PDF Ingestion Features
RAGIndex implements a sophisticated PDF processing pipeline that goes far beyond basic text extraction:
📄 Intelligent Document Processing
- Page-Level Tracking: Each page is individually processed with embedded page numbers (
PAGE_NUM=1,PAGE_NUM=2, etc.) for precise source attribution - Metadata Preservation: Complete document metadata including source filename, page numbers, and processing timestamps
- Semantic Chunking: Uses LlamaIndex's SemanticSplitterNodeParser for intelligent content-aware splitting rather than naive character limits
🔄 Multi-Stage Processing Pipeline
- Primary Extraction: PyPDF2-based text extraction for standard PDFs
- OCR Fallback: Automatic detection of image-based PDFs with Tesseract OCR processing
- Format Conversion: DOCX and TXT files automatically converted to PDF format for consistent processing
- Quality Validation: Empty document detection with automatic fallback to OCR processing
🛡️ Robust Error Handling & Recovery
- Automatic Retry Logic: Failed document ingestion automatically triggers cleanup and retry mechanisms
- Memory Management: OutOfMemoryError handling with graceful degradation
- Document Store Cleanup: Automatic removal of partially processed documents to maintain data integrity
- Progress Tracking: Real-time feedback with detailed statistics on node generation and ingestion success
📊 Advanced Document Store Management
- Duplicate Detection:
DocstoreStrategy.DUPLICATES_ONLYprevents re-processing of identical documents - Redis-Backed Storage: High-performance document storage with persistence and scalability
- Ingestion Caching: Intelligent caching system to speed up repeated operations
- Metadata Indexing: Searchable metadata fields including source attribution and page references
🎯 Precision Source Attribution
When you ask questions, RAGIndex doesn't just provide answers—it tells you exactly which document and page the information came from, enabling:
- Citation Accuracy: Precise page-level source references
- Content Verification: Easy verification of AI responses against source documents
- Context Preservation: Maintains document structure and page relationships
🛠️ Technology Stack
Core Technologies
- LlamaIndex: Advanced RAG framework for document indexing and retrieval
- Streamlit: Modern web app framework for data science and AI
- Redis: In-memory vector database for high-performance search
- HuggingFace Transformers: Pre-trained embedding models
Document Processing & Ingestion
- PyPDF2: Primary PDF text extraction with page-level tracking
- Tesseract OCR: Intelligent OCR fallback for image-based PDFs with custom configuration
- pdf2image: High-quality PDF to image conversion for OCR processing
- python-docx: DOCX file processing with automatic PDF conversion
- PyMuPDF: Advanced PDF processing and metadata extraction
- Document Tracking: Page number injection, source attribution, and metadata preservation
- Error Handling: Robust fallback mechanisms and automatic retry logic
- Deduplication: Document fingerprinting and duplicate prevention system
AI & ML
- OpenAI API: Large language model integration
- sentence-transformers: Text embedding generation
- NLTK: Natural language processing utilities
⚙️ Configuration
Embedding Model Settings
[embed_model]
model_name = "BAAI/bge-base-en-v1.5"
cache_folder = "/RAGIndex/store/models"
embed_batch_size = 1
Document Chunking
[transformations]
chunk_size = 1000
chunk_overlap = 100
Redis Configuration
[redis]
host_name = 'redis'
port_no = 6379
doc_store_name = "DocStore_v1"
vector_index_name = "VecStore_v1"
🔧 Advanced Usage
Custom Embedding Models
Replace the embedding model in config.toml:
model_name = "your-custom-huggingface-model"
Scaling with Docker
For production deployment with multiple instances:
docker compose up -d --scale docqna=3
Note: Streamlit apps are single-threaded. Multiple instances can be run behind a load balancer (e.g., nginx) for better concurrency handling.
API Integration
The application can be extended with REST API endpoints for programmatic access.
🧪 Development
Local Development Setup
# Install dependencies
pip install -r requirements.txt
# Start Redis
docker run -d -p 6379:6379 redis/redis-stack-server:latest
# Run Streamlit app
streamlit run src/app.py
Project Structure
RAGIndex/
├── src/
│ ├── app.py # Main Streamlit application
│ └── RAGIndex/
│ ├── chat/ # LlamaIndex conversation engine
│ ├── pipeline/ # Document processing pipeline
│ ├── pdf_ingest/ # PDF processing utilities
│ └── stcomp/ # Streamlit components
├── config.toml # Application configuration
├── requirements.txt # Python dependencies
├── docker-compose.yml # Docker deployment
└── Dockerfile # Container definition
🤝 Contributing
We welcome contributions!
Development Workflow
- Fork the repository
- C
