MultiModalRag

A Multi-Modal Agentic RAG pipeline designed to handle unstructured documents containing tables, charts, and images. It integrates Docling and ElasticSearch for structured indexing, and leverages LangGraph for agent-based reasoning and dynamic query reformulation.

Generate Convert Improve

Install / Use

/learn @Alijanloo/MultiModalRag

About this skill

Quality Score

0/100

README

🧠 MultiModal Agentic RAG

A Multimodal, Agentic RAG System for Intelligent Document Processing

Demo

📘 Overview

The MultiModal Agentic RAG is designed to intelligently process and retrieve information from complex, unstructured documents by leveraging a Retrieval-Augmented Generation (RAG) pipeline with multimodal (text, table, and image) understanding.

It integrates advanced models for document structure analysis, semantic chunking, and agentic reasoning, enabling context-aware, explainable question answering.

This system is demonstrated using the Oxford Textbook of Medicine (6th Edition), chosen for its diverse complexity including detailed tables, diagnostic diagrams, medical images, and multi-level content organization.

🎯 Project Goals

Convert unstructured PDFs into structured and searchable formats (Markdown & JSON).
Preserve document hierarchy, semantic relationships, and contextual information.
Analyze multimodal content—text, tables, and diagnostic images—through specialized models.
Build an intelligent retrieval system using RAG with hierarchical context preservation.
Enable accurate conversational chatbot using information retrieved from text, tables, and images.
Implement modular, maintainable system architecture following Clean Architecture principles.

🏗️ System Architecture

This project follows Clean Architecture to ensure high scalability, maintainability, and testability. Dependencies flow inward—from frameworks toward the core business logic.

Layers

| Layer | Description | | ------------------------ | ---------------------------------------------------------------------------------- | | Entities | Core data models and validation independent of frameworks. | | Use Cases | Implements the application’s business logic (indexing and inference pipelines) | | Interface Adapters | Bridges entities and external interfaces (e.g., APIs, vector stores). | | Frameworks & Drivers | Outer layer containing frameworks (PTB, GenAi, etc.). |

Key Design Principles

Dependency Rule: Inner layers never depend on outer layers.
Dependency Inversion: Interfaces are defined at the core, implementations in outer layers.
Dependency Injection: Managed via the dependency-injector library for modular configuration.

🧩 Core Components

🧱 1. Docling-Based Document Processing

The system uses Docling, an open-source library developed by IBM Research, to parse and structure PDFs.

Key Features of Prepared Indexing Pipeline:

Hierarchical structure recognition (titles, sections, paragraphs).
Accurate table extraction via TableFormer.
Layout analysis using Heron (RT-DETR + ResNet-50).
Optical Character Recognition (OCR) with EasyOCR (CRAFT + CRNN).
Image captioning with Gemini 2.5-Flash to enable content-based image retrieval. For example, when text discusses a skin condition and includes a diagnostic image, the image description enables retrieval based on visual content, enriching responses with related diagnostic information.
Intelligent contextual chunking that preserves semantic relations.
Structured output in Markdown and JSON for downstream processing.

🧠 2. Agentic RAG System

Agentic RAG Workflow

The RAG system uses LangGraph to create an agentic reasoning workflow, where each node represents a step in the retrieval and generation process.

Key Capabilities:

Intelligent decision-making to determine when to retrieve or directly answer.
Context-aware document retrieval with Elasticsearch.
Query rewriting for unanswerable or ambiguous questions (e.g., "symptoms of type 2 diabetes mellitus" → "symptoms of type 2 diabetes mellitus diagnosis criteria").
Answer generation with detailed source attribution.
Visual grounding: retrieves and attaches related images and metadata to answers.
Context preservation across dialogue sessions.

🖼️ 3. Multimodal Integration

The system supports multimodal inputs:

Text: Paragraphs, summaries, and semantic relationships.
Tables: Extracted with TableFormer and stored with relational metadata.
Images: Described by Gemini 2.5-Flash and linked to related textual content.

Each chunk is semantically enriched with metadata (e.g., section, page number, caption, and hierarchy).

📚 References

Oxford Textbook of Medicine, 6th Edition
Docling: https://github.com/IBM/Docling
LangGraph: https://github.com/langchain-ai/langgraph
TableFormer, Heron, EasyOCR, Gemini 2.5-Flash documentation
Comparison study: Docling vs GPT-5

Quick Start

Installation

uv pip install torch --index-url https://download.pytorch.org/whl/cpu
uv pip install -e .

Launch

# Copy and customize configuration
cp config.yaml.example config.yaml
# Edit config.yaml with your API tokens and settings

# Run the application
python multimodal_rag

📄 License

This project is licensed under the MIT License .

Related Skills

diffs

344.1k

Use the diffs tool to produce real, shareable diffs (viewer URL, file artifact, or both) instead of manual edit summaries.

clearshot

Structured screenshot analysis for UI implementation and critique. Analyzes every UI screenshot with a 5×5 spatial grid, full element inventory, and design system extraction — facts and taste together, every time. Escalates to full implementation blueprint when building. Trigger on any digital interface image file (png, jpg, gif, webp — websites, apps, dashboards, mockups, wireframes) or commands like 'analyse this screenshot,' 'rebuild this,' 'match this design,' 'clone this.' Skip for non-UI images (photos, memes, charts) unless the user explicitly wants to build a UI from them. Does NOT trigger on HTML source code, CSS, SVGs, or any code pasted as text.

openpencil

2.0k

The world's first open-source AI-native vector design tool and the first to feature concurrent Agent Teams. Design-as-Code. Turn prompts into UI directly on the live canvas. A modern alternative to Pencil.

HappyColorBlend

HappyColorBlendVibe Project Guidelines Project Overview HappyColorBlendVibe is a Figma plugin for color palette generation with advanced tint/shade blending capabilities. It allows designers to

Alijanloo

View profile

View on GitHub

GitHub Stars4

CategoryDesign

Updated2mo ago

Forks0

Alijanloo/MultiModalRag

Languages

Python

Security Score

90/100

Audited on Jan 26, 2026

No findings