VARAG
Vision-Augmented Retrieval and Generation (VARAG) - Vision first RAG Engine
Install / Use
/learn @adithya-s-k/VARAGREADME
👁️👁️ VARAG
Vision Augmented Retrieval and Generation
|
| VARAG (Vision-Augmented Retrieval and Generation) is a vision-first RAG engine that emphasizes vision-based retrieval techniques. It enhances traditional Retrieval-Augmented Generation (RAG) systems by integrating both visual and textual data through Vision-Language models. |
|:--:|:--|
Supported Retrieval Techniques
VARAG supports a wide range of retrieval techniques, optimized for different use cases, including text, image, and multimodal document retrieval. Below are the primary techniques supported:
<details> <summary>Simple RAG (with OCR)</summary> Simple RAG (Retrieval-Augmented Generation) is an efficient and straightforward approach to extracting text from documents and feeding it into a retrieval pipeline. VARAG incorporates Optical Character Recognition (OCR) through Docling, making it possible to process and index scanned PDFs or images. After the text is extracted and indexed, queries can be matched to relevant passages in the document, providing a strong foundation for generating responses that are grounded in the extracted information. This technique is ideal for text-heavy documents like scanned books, contracts, and research papers, and can be paired with Large Language Models (LLMs) to produce contextually aware outputs. </details> <details> <summary>Vision RAG</summary> Vision RAG extends traditional RAG techniques by incorporating the retrieval of visual information, bridging the gap between text and images. Using a powerful cross-modal embedding model like JinaCLIP (a variant of CLIP developed by Jina AI), both text and images are encoded into a shared vector space. This allows for similarity searches across different modalities, meaning that images can be queried alongside text. Vision RAG is particularly useful for document analysis tasks where visual components (e.g., figures, diagrams, images) are as important as the textual content. It’s also effective for tasks like image captioning or generating product descriptions where understanding and correlating text with visual elements is critical. </details> <details> <summary>ColPali RAG</summary> ColPali RAG represents a cutting-edge approach that simplifies the traditional retrieval pipeline by directly embedding document pages as images rather than converting them into text. This method leverages PaliGemma, a Vision Language Model (VLM) from the Google Zürich team, which encodes entire document pages into vector embeddings, treating the page layout and visual elements as part of the retrieval process. Using a late interaction mechanism inspired by ColBERT (Column BERT), ColPali RAG enhances retrieval by enabling token-level matching between user queries and document patches. This approach ensures high retrieval accuracy while also maintaining reasonable indexing and querying speeds. It is particularly beneficial for documents rich in visuals, such as infographics, tables, and complex layouts, where conventional text-based retrieval methods struggle. </details> <details> <summary>Hybrid ColPali RAG</summary> Hybrid ColPali RAG further enhances retrieval performance by combining the strengths of both image embeddings and ColPali’s late interaction mechanism. In this approach, the system first performs a coarse retrieval step using image embeddings (e.g., from a model like JinaCLIP) to retrieve the top-k relevant document pages. Then, in a second pass, the system re-ranks these k pages using the ColPali late interaction mechanism to identify the final set of most relevant pages based on both visual and textual information. This hybrid approach is particularly useful when documents contain a mixture of complex visuals and detailed text, allowing the system to leverage both content types for highly accurate document retrieval. </details>🚀 Getting Started with VARAG
Follow these steps to set up VARAG:
1. Clone the Repository
git clone https://github.com/adithya-s-k/VARAG
cd VARAG
2. Set Up Environment
Create and activate a virtual environment using Conda:
conda create -n varag-venv python=3.10
conda activate varag-venv
3. Install Dependencies
Install the required packages using pip:
pip install -e .
# or
poetry install
To install OCR dependencies:
pip install -e .["ocr"]
Try Out VARAG
Explore VARAG with our interactive playground! It lets you seamlessly compare various RAG (Retrieval-Augmented Generation) solutions, from data ingestion to retrieval.
You can run it locally or on Google Colab:
python demo.py --share
This makes it easy to test and experiment with different approaches in real-time.
🚀 Cloud Deployment with Modal
VARAG provides ready-to-use Modal deployment configurations for running ColPali comparison demos in the cloud with GPU acceleration.
Available Deployment Options
| App | Description | Command |
|---------|-----------------|-------------|
| modal_demo_heatmaps_comparing_colpali_models | ColPali model comparison with heatmaps and similarity analysis on Modal GPU | python -m modal run examples/inference_colpali/modal_demo_heatmaps_comparing_colpali_models.py::comparision_demo |
Prerequisites
- Install Modal:
pip install modal - Setup Modal account:
modal setup - Configure secrets: Set up
hf-secretin Modal dashboard with your HuggingFace token
Quick Deploy
# Deploy the ColPali model comparison demo with heatmaps
python -m modal run examples/inference_colpali/modal_demo_heatmaps_comparing_colpali_models.py::comparision_demo
Features
- GPU Acceleration: Automatic L4 GPU provisioning
- Model Caching: Persistent volume for fast model loading
- Memory Optimization: Configured for T4/L4 GPU constraints
- Auto-scaling: Pay-per-use with automatic scaling to zero
- Public Access: Generates shareable Gradio URLs
Environment Variables
The deployment uses a .env file for configuration. Key variables:
GEMINI_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=your_openai_api_key_here # Optional
HF_TOKEN=your_huggingface_token_here # Set in Modal secrets
How VARAG is structured
Each RAG technique is structured as a class, abstracting all components and offering the following methods:
from varag.rag import {{RAGTechnique}}
ragTechnique = RAGTechnique()
ragTechnique.index(
"/path_to_data_source",
other_relevant_data
)
results = ragTechnique.search("query", top_k=5)
# These results can be passed into the LLM / VLM of your choice
Why Abstract So Much?
I initially set out to rapidly test and evaluate different Vision-based RAG (Retrieval-Augmented Generation) systems to determine which one best fits my use case. I wasn’t aiming to create a framework or library, but it naturally evolved into one.
The abstraction is designed to simplify the process of experimenting with different RAG paradigms without complicating compatibility between components. To keep things straightforward, LanceDB was chosen as the vector store due to its ease of use and high customizability.
This paradigm is inspired by the Byaldi repo by Answer.ai.
Techniques and Notebooks
| Technique | Notebook | Demo |
|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|
| Simple RAG | | simpleRAG.py |
| Vision RAG |
| visionDemo.py |
| Colpali RAG |
| colpaliDemo.py |
| Hybrid Colpali RAG| [
](https://colab.research.google
Related Skills
node-connect
352.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
