ColPali: Efficient Document Retrieval with Vision Language Models 👀

[Model card] [ViDoRe Leaderboard] [Demo] [Blog Post]

Associated Paper

This repository contains the code used for training the vision retrievers in the ColPali: Efficient Document Retrieval with Vision Language Models paper. In particular, it contains the code for training the ColPali model, which is a vision retriever based on the ColBERT architecture and the PaliGemma model.

Introduction

With our new model ColPali, we propose to leverage VLMs to construct efficient multi-vector embeddings in the visual space for document retrieval. By feeding the ViT output patches from PaliGemma-3B to a linear projection, we create a multi-vector representation of documents. We train the model to maximize the similarity between these document embeddings and the query embeddings, following the ColBERT method.

Using ColPali removes the need for potentially complex and brittle layout recognition and OCR pipelines with a single model that can take into account both the textual and visual content (layout, charts, ...) of a document.

ColPali Architecture

List of ColVision models

| Model | Score on ViDoRe 🏆 | License | Comments | Currently supported | |---------------------------------------------------------------------|-------------------------------------------------------------------------------|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------| | vidore/colpali | 81.3 | Gemma | • Based on google/paligemma-3b-mix-448. • Checkpoint used in the ColPali paper. | ❌ | | vidore/colpali-v1.1 | 81.5 | Gemma | • Based on google/paligemma-3b-mix-448. • Fix right padding for queries. | ✅ | | vidore/colpali-v1.2 | 83.9 | Gemma | • Similar to vidore/colpali-v1.1. | ✅ | | vidore/colpali-v1.3 | 84.8 | Gemma | • Similar to vidore/colpali-v1.2. • Trained with a larger effective batch size of 256 batch size for 3 epochs. | ✅ | | vidore/colqwen2-v0.1 | 87.3 | Apache 2.0 | • Based on Qwen/Qwen2-VL-2B-Instruct. • Supports dynamic resolution. • Trained using 768 image patches per page and an effective batch size of 32. | ✅ | | vidore/colqwen2-v1.0 | 89.3 | Apache 2.0 | • Similar to vidore/colqwen2-v0.1, but trained with more powerful GPUs and with a larger effective batch size (256). | ✅ | | vidore/colqwen2.5-v0.1 | 88.8 | Apache 2.0 | • Based on Qwen/Qwen2 5-VL-3B-Instruct • Supports dynamic resolution. • Trained using 768 image patches per page and an effective batch size of 32. | ✅ | | vidore/colqwen2.5-v0.2 | 89.4 | Apache 2.0 | • Similar to vidore/colqwen2.5-v0.1, but trained with slightly different hyper parameters | ✅ | | TomoroAI/tomoro-colqwen3-embed-4b | 90.6 | Apache 2.0 | • Based on the Qwen3-VL backbone. • 320-dim ColBERT-style embeddings with dynamic resolution. • Trained for multi-vector document retrieval. | ✅ | | athrael-soju/colqwen3.5-4.5B-v3 | 90.9 | Apache 2.0 | • Based on Qwen/Qwen3.5-4B (hybrid GatedDeltaNet + full-attention). • 320-dim ColBERT-style embeddings. • 4.5B params, LoRA-trained. | ✅ | | vidore/colSmol-256M | 80.1 | Apache 2.0 | • Based on HuggingFaceTB/SmolVLM-256M-Instruct. | ✅ | | vidore/colSmol-500M | 82.3 | Apache 2.0 | • Based on HuggingFaceTB/SmolVLM-500M-Instruct. | ✅ | | Cognitive-Lab/ColNetraEmbed | 86.4 | Gemma | • Based on google/gemma-3-4b-it. • Multi-vector late interaction retrieval model. • Multilingual support across 22 languages. | ✅ | | Cognitive-Lab/NetraEmbed | 81.0 | Gemma | • Based on google/gemma-3-4b-it. • Bi-encoder retrieval model. • Supports Matryoshka embeddings (768, 1536, 2560). • Multilingual support across 22 languages. | ✅ |

Setup

We used Python 3.11.6 and PyTorch 2.4 to train and test our models, but the codebase is compatible with Python >=3.10 and recent PyTorch versions. To install the package, run:

pip install colpali-engine # from PyPi
pip install git+https://github.com/illuin-tech/colpali # from source

Mac users using MPS with the ColQwen models have reported errors with torch 2.6.0. These errors are fixed by downgrading to torch 2.5.1.

[!WARNING] For ColPali versions above v1.0, make sure to install the colpali-engine package from source or with a version above v0.2.0.

Usage

Quick start

import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import ColQwen2, ColQwen2Processor

model_name = "vidore/colqwen2-v1.0"

model = ColQwen2.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()

processor = ColQwen2Processor.from_pretrained(model_name)

# Your inputs
images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

We now support fast-plaid experimentally to make matching quicker for larger corpus sizes:

# !pip install --no

Colpali

Install / Use

README