MultiModalRag
A Multi-Modal Agentic RAG pipeline designed to handle unstructured documents containing tables, charts, and images. It integrates Docling and ElasticSearch for structured indexing, and leverages LangGraph for agent-based reasoning and dynamic query reformulation.
Install / Use
/learn @Alijanloo/MultiModalRagREADME
🧠 MultiModal Agentic RAG
A Multimodal, Agentic RAG System for Intelligent Document Processing

📘 Overview
The MultiModal Agentic RAG is designed to intelligently process and retrieve information from complex, unstructured documents by leveraging a Retrieval-Augmented Generation (RAG) pipeline with multimodal (text, table, and image) understanding.
It integrates advanced models for document structure analysis, semantic chunking, and agentic reasoning, enabling context-aware, explainable question answering.
This system is demonstrated using the Oxford Textbook of Medicine (6th Edition), chosen for its diverse complexity including detailed tables, diagnostic diagrams, medical images, and multi-level content organization.
🎯 Project Goals
- Convert unstructured PDFs into structured and searchable formats (Markdown & JSON).
- Preserve document hierarchy, semantic relationships, and contextual information.
- Analyze multimodal content—text, tables, and diagnostic images—through specialized models.
- Build an intelligent retrieval system using RAG with hierarchical context preservation.
- Enable accurate conversational chatbot using information retrieved from text, tables, and images.
- Implement modular, maintainable system architecture following Clean Architecture principles.
🏗️ System Architecture
This project follows Clean Architecture to ensure high scalability, maintainability, and testability. Dependencies flow inward—from frameworks toward the core business logic.
Layers
| Layer | Description | | ------------------------ | ---------------------------------------------------------------------------------- | | Entities | Core data models and validation independent of frameworks. | | Use Cases | Implements the application’s business logic (indexing and inference pipelines) | | Interface Adapters | Bridges entities and external interfaces (e.g., APIs, vector stores). | | Frameworks & Drivers | Outer layer containing frameworks (PTB, GenAi, etc.). |
Key Design Principles
- Dependency Rule: Inner layers never depend on outer layers.
- Dependency Inversion: Interfaces are defined at the core, implementations in outer layers.
- Dependency Injection: Managed via the
dependency-injectorlibrary for modular configuration.
🧩 Core Components
🧱 1. Docling-Based Document Processing
The system uses Docling, an open-source library developed by IBM Research, to parse and structure PDFs.
Key Features of Prepared Indexing Pipeline:
- Hierarchical structure recognition (titles, sections, paragraphs).
- Accurate table extraction via TableFormer.
- Layout analysis using Heron (RT-DETR + ResNet-50).
- Optical Character Recognition (OCR) with EasyOCR (CRAFT + CRNN).
- Image captioning with Gemini 2.5-Flash to enable content-based image retrieval. For example, when text discusses a skin condition and includes a diagnostic image, the image description enables retrieval based on visual content, enriching responses with related diagnostic information.
- Intelligent contextual chunking that preserves semantic relations.
- Structured output in Markdown and JSON for downstream processing.
🧠 2. Agentic RAG System

The RAG system uses LangGraph to create an agentic reasoning workflow, where each node represents a step in the retrieval and generation process.
Key Capabilities:
- Intelligent decision-making to determine when to retrieve or directly answer.
- Context-aware document retrieval with Elasticsearch.
- Query rewriting for unanswerable or ambiguous questions (e.g., "symptoms of type 2 diabetes mellitus" → "symptoms of type 2 diabetes mellitus diagnosis criteria").
- Answer generation with detailed source attribution.
- Visual grounding: retrieves and attaches related images and metadata to answers.
- Context preservation across dialogue sessions.
🖼️ 3. Multimodal Integration
The system supports multimodal inputs:
- Text: Paragraphs, summaries, and semantic relationships.
- Tables: Extracted with TableFormer and stored with relational metadata.
- Images: Described by Gemini 2.5-Flash and linked to related textual content.
Each chunk is semantically enriched with metadata (e.g., section, page number, caption, and hierarchy).
📚 References
- Oxford Textbook of Medicine, 6th Edition
- Docling: https://github.com/IBM/Docling
- LangGraph: https://github.com/langchain-ai/langgraph
- TableFormer, Heron, EasyOCR, Gemini 2.5-Flash documentation
- Comparison study: Docling vs GPT-5
Quick Start
Installation
uv pip install torch --index-url https://download.pytorch.org/whl/cpu
uv pip install -e .
Launch
# Copy and customize configuration
cp config.yaml.example config.yaml
# Edit config.yaml with your API tokens and settings
# Run the application
python multimodal_rag
📄 License
This project is licensed under the MIT License .
Related Skills
diffs
344.1kUse the diffs tool to produce real, shareable diffs (viewer URL, file artifact, or both) instead of manual edit summaries.
clearshot
Structured screenshot analysis for UI implementation and critique. Analyzes every UI screenshot with a 5×5 spatial grid, full element inventory, and design system extraction — facts and taste together, every time. Escalates to full implementation blueprint when building. Trigger on any digital interface image file (png, jpg, gif, webp — websites, apps, dashboards, mockups, wireframes) or commands like 'analyse this screenshot,' 'rebuild this,' 'match this design,' 'clone this.' Skip for non-UI images (photos, memes, charts) unless the user explicitly wants to build a UI from them. Does NOT trigger on HTML source code, CSS, SVGs, or any code pasted as text.
openpencil
2.0kThe world's first open-source AI-native vector design tool and the first to feature concurrent Agent Teams. Design-as-Code. Turn prompts into UI directly on the live canvas. A modern alternative to Pencil.
HappyColorBlend
HappyColorBlendVibe Project Guidelines Project Overview HappyColorBlendVibe is a Figma plugin for color palette generation with advanced tint/shade blending capabilities. It allows designers to
