SkillAgentSearch skills...

RAGIndex

LlamaIndex Powered RAG for PDF, TXT and DOCX files with Tesseract OCR support, Semantic chunking, Document citations with direct page display, Advanced Caching and Duplicate Detection with Redis Vector DB

Install / Use

/learn @rigvedrs/RAGIndex
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

📚 RAGIndex: Retrieval Augmented Generation (RAG) with LlamaIndex & Streamlit

Streamlit LlamaIndex Python Redis Docker

Transform your documents into an intelligent Q&A system using LlamaIndex RAG capabilities and Streamlit's interactive interface. Upload PDFs, DOCX, or TXT files and get instant, contextual answers powered by advanced AI embeddings.

https://github.com/user-attachments/assets/a59324bf-1e8a-4dc5-a3b6-7cc32c9ad31f

🎯 What is RAGIndex?

RAGIndex is a Retrieval-Augmented Generation (RAG) application that leverages LlamaIndex for document processing and Streamlit for the user interface. It transforms static documents into an interactive knowledge base where you can ask questions and receive accurate, context-aware answers.

🔥 Key Features

  • 🚀 LlamaIndex-Powered RAG: Advanced document indexing and retrieval using LlamaIndex's state-of-the-art RAG pipeline
  • 💻 Streamlit Web Interface: Beautiful, responsive UI built with Streamlit for seamless user experience
  • 📄 Advanced Document Processing: Multi-format support (PDF, DOCX, TXT) with intelligent text extraction and metadata preservation
  • 🔍 Intelligent PDF Ingestion: Sophisticated PDF processing with page-level tracking, automatic fallback mechanisms, and detailed metadata retention
  • 🧠 Smart OCR Pipeline: Automatic OCR processing for image-based PDFs using Tesseract with custom configuration and error handling
  • 📊 Document Tracking & Deduplication: Advanced document store with Redis-backed tracking, duplicate detection, and ingestion state management
  • ⚡ High-Performance Vector Storage: Redis vector store with semantic search, metadata fields, and optimized retrieval
  • 🔄 Real-time Processing: Document processing with progress tracking and detailed ingestion statistics
  • 🎨 Modern UI: Clean, intuitive interface with chat-style interactions and comprehensive error feedback
  • 🐳 Containerized Deployment: Fully containerized with Docker Compose for easy setup and deployment

🏗️ Architecture

LlamaIndex Integration

  • Embedding Model: bge-base-en-v1.5 for high-quality text embeddings with semantic splitting
  • Vector Store: Redis-backed vector storage with metadata fields for source tracking and page numbering
  • Document Processing: Semantic-aware text chunking with intelligent overlap and page boundary preservation
  • Document Store: Redis document store with duplicate detection and ingestion state tracking
  • Query Engine: LlamaIndex conversation engine for contextual responses with source attribution
  • Ingestion Pipeline: Advanced pipeline with caching, error handling, and automatic retry mechanisms

Streamlit Frontend

  • Interactive File Upload: Multi-file upload with progress tracking
  • Real-time Chat: Chat-style interface for natural Q&A interactions
  • Session Management: Persistent conversation state across interactions
  • Responsive Design: Modern, mobile-friendly interface

🚀 Quick Start

Prerequisites

  • Docker and Docker Compose
  • 4GB+ RAM recommended
  • Internet connection for model downloads

1. Clone the Repository

git clone https://github.com/rigvedrs/RAGIndex.git
cd RAGIndex

2. Environment Setup

Create a .env file with your OpenAI API key:

echo "OPENAI_API_KEY=your_openai_api_key_here" > .env

3. Launch with Docker

docker compose up --build

4. Access the Application

Open your browser and navigate to: http://localhost:8501

📖 How to Use

  1. Upload Documents: Use the sidebar to upload PDF, DOCX, or TXT files
  2. Process Documents: Click "Analyze" to process and index your documents
  3. Ask Questions: Type your questions in the chat interface
  4. Get Answers: Receive contextual answers based on your documents

🔍 Advanced PDF Ingestion Features

RAGIndex implements a sophisticated PDF processing pipeline that goes far beyond basic text extraction:

📄 Intelligent Document Processing

  • Page-Level Tracking: Each page is individually processed with embedded page numbers (PAGE_NUM=1, PAGE_NUM=2, etc.) for precise source attribution
  • Metadata Preservation: Complete document metadata including source filename, page numbers, and processing timestamps
  • Semantic Chunking: Uses LlamaIndex's SemanticSplitterNodeParser for intelligent content-aware splitting rather than naive character limits

🔄 Multi-Stage Processing Pipeline

  1. Primary Extraction: PyPDF2-based text extraction for standard PDFs
  2. OCR Fallback: Automatic detection of image-based PDFs with Tesseract OCR processing
  3. Format Conversion: DOCX and TXT files automatically converted to PDF format for consistent processing
  4. Quality Validation: Empty document detection with automatic fallback to OCR processing

🛡️ Robust Error Handling & Recovery

  • Automatic Retry Logic: Failed document ingestion automatically triggers cleanup and retry mechanisms
  • Memory Management: OutOfMemoryError handling with graceful degradation
  • Document Store Cleanup: Automatic removal of partially processed documents to maintain data integrity
  • Progress Tracking: Real-time feedback with detailed statistics on node generation and ingestion success

📊 Advanced Document Store Management

  • Duplicate Detection: DocstoreStrategy.DUPLICATES_ONLY prevents re-processing of identical documents
  • Redis-Backed Storage: High-performance document storage with persistence and scalability
  • Ingestion Caching: Intelligent caching system to speed up repeated operations
  • Metadata Indexing: Searchable metadata fields including source attribution and page references

🎯 Precision Source Attribution

When you ask questions, RAGIndex doesn't just provide answers—it tells you exactly which document and page the information came from, enabling:

  • Citation Accuracy: Precise page-level source references
  • Content Verification: Easy verification of AI responses against source documents
  • Context Preservation: Maintains document structure and page relationships

🛠️ Technology Stack

Core Technologies

  • LlamaIndex: Advanced RAG framework for document indexing and retrieval
  • Streamlit: Modern web app framework for data science and AI
  • Redis: In-memory vector database for high-performance search
  • HuggingFace Transformers: Pre-trained embedding models

Document Processing & Ingestion

  • PyPDF2: Primary PDF text extraction with page-level tracking
  • Tesseract OCR: Intelligent OCR fallback for image-based PDFs with custom configuration
  • pdf2image: High-quality PDF to image conversion for OCR processing
  • python-docx: DOCX file processing with automatic PDF conversion
  • PyMuPDF: Advanced PDF processing and metadata extraction
  • Document Tracking: Page number injection, source attribution, and metadata preservation
  • Error Handling: Robust fallback mechanisms and automatic retry logic
  • Deduplication: Document fingerprinting and duplicate prevention system

AI & ML

  • OpenAI API: Large language model integration
  • sentence-transformers: Text embedding generation
  • NLTK: Natural language processing utilities

⚙️ Configuration

Embedding Model Settings

[embed_model]
model_name = "BAAI/bge-base-en-v1.5"
cache_folder = "/RAGIndex/store/models"
embed_batch_size = 1

Document Chunking

[transformations]
chunk_size = 1000
chunk_overlap = 100

Redis Configuration

[redis]
host_name = 'redis'
port_no = 6379
doc_store_name = "DocStore_v1"
vector_index_name = "VecStore_v1"

🔧 Advanced Usage

Custom Embedding Models

Replace the embedding model in config.toml:

model_name = "your-custom-huggingface-model"

Scaling with Docker

For production deployment with multiple instances:

docker compose up -d --scale docqna=3

Note: Streamlit apps are single-threaded. Multiple instances can be run behind a load balancer (e.g., nginx) for better concurrency handling.

API Integration

The application can be extended with REST API endpoints for programmatic access.

🧪 Development

Local Development Setup

# Install dependencies
pip install -r requirements.txt

# Start Redis
docker run -d -p 6379:6379 redis/redis-stack-server:latest

# Run Streamlit app
streamlit run src/app.py

Project Structure

RAGIndex/
├── src/
│   ├── app.py                 # Main Streamlit application
│   └── RAGIndex/
│       ├── chat/              # LlamaIndex conversation engine
│       ├── pipeline/          # Document processing pipeline
│       ├── pdf_ingest/        # PDF processing utilities
│       └── stcomp/            # Streamlit components
├── config.toml                # Application configuration
├── requirements.txt           # Python dependencies
├── docker-compose.yml         # Docker deployment
└── Dockerfile                 # Container definition

🤝 Contributing

We welcome contributions!

Development Workflow

  1. Fork the repository
  2. C
View on GitHub
GitHub Stars12
CategoryCustomer
Updated1mo ago
Forks3

Languages

Python

Security Score

80/100

Audited on Mar 3, 2026

No findings