ChatSearch

ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval

Generate Convert Improve

Install / Use

/learn @CASIA-IVA-Lab/ChatSearch

About this skill

Quality Score

0/100

README

Chatsearch: A dataset and a generative retrieval model for general conversational image retrieval

A unified image retrieval system based on large multimodal models, supporting general conversational image retrieval tasks.

Introduction

UniChatIR is the official implementation of the Chatsearch paper, a generative image retrieval system based on the Emu/LLaVA architecture. The system leverages large multimodal models (LMMs) to achieve high-quality text-to-image retrieval and conversational image retrieval. The system adopts a generative retrieval paradigm, generating unified image representations through large language models, and supports various retrieval tasks and datasets, including:

Standard Image Retrieval: Flickr30K, COCO, etc.
Compositional Image Retrieval: CIRR (Compositional Image Retrieval)
Fashion Image Retrieval: Fashion-IQ
Visual Story Retrieval: VIST (Visual Storytelling)
Conversational Image Retrieval: Supports multi-turn conversational context understanding

Key Features

🎯 Multi-task Support: Supports various image retrieval tasks and datasets
🚀 Generative Retrieval: Adopts a generative retrieval paradigm, leveraging large language models to generate unified image representations
💬 Conversational Retrieval: Supports multi-turn conversational context understanding for general conversational image retrieval
🔧 Easy to Use: Provides simple command-line interfaces and Gradio demo interface
📊 Flexible Configuration: Supports various model configurations and evaluation metrics
🎨 Multimodal Fusion: Unified architecture based on CLIP visual encoder and LLaMA language model

Installation

Requirements

Python >= 3.8
PyTorch >= 1.12.0
CUDA >= 11.0 (recommended)

Installation Steps

Clone the repository:

git clone https://github.com/CASIA-IVA-Lab/ChatSearch.git
cd ChatSearch

Create a virtual environment and install dependencies:

conda create -n unichatir python=3.10 -y
conda activate unichatir
pip install --upgrade pip
pip install -e .

Install training-related dependencies (optional):

pip install ninja
pip install flash-attn --no-build-isolation

Quick Start

1. Prepare Data

First, you need to prepare the following data:

Image Feature Files: Pre-computed image features (.pt format)
Annotation Files: JSON files containing image IDs and metadata
Image Directory: Root directory of image files

2. Run Demo

Use the Gradio interface for interactive image retrieval:

python demo.py \
    --model-cfg emu_models/Emu-8B_frozenvis_cliploss.json \
    --checkpoint /path/to/checkpoint.pth \
    --image-feat-path /path/to/image_features.pt \
    --annotation-path /path/to/annotations.json \
    --image-root /path/to/images

3. Evaluate Model

Evaluate model performance on standard datasets:

python utils/retrieval_new.py \
    --checkpoint /path/to/checkpoint.pth \
    --model-cfg emu_models/Emu-8B_frozenvis_cliploss_vitl.json \
    --vis-roots /path/to/images1,/path/to/images2 \
    --ann-paths /path/to/ann1.json,/path/to/ann2.json \
    --bs 16 \
    --evaluate

Project Structure

unichatir/
├── demo.py                    # Gradio demo interface
├── utils/
│   ├── retrieval_new.py           # Standard image retrieval evaluation
│   ├── retrieval_new_cirr.py      # CIRR dataset evaluation
│   ├── retrieval_new_fashion.py   # Fashion-IQ dataset evaluation
│   ├── retrieval_new_vist.py      # VIST dataset evaluation
│   └── extract_vitfeat_*.py       # Image feature extraction scripts
├── emu_models/                # Model definitions
│   ├── modeling_uniir.py      # Unified image retrieval model (Emu_clip_VIT)
│   ├── modeling_llama.py       # LLaMA language model (supports classification and regression)
│   ├── eva_vit.py             # EVA ViT visual encoder
│   └── ...
├── llava/                      # LLaVA related code
│   ├── dataset_finetune.py    # Dataset definitions
│   ├── dataset_cirr.py        # CIRR dataset
│   ├── processors/            # Data processors
│   └── train/                 # Training scripts
└── scripts/                    # Training and evaluation scripts

Usage

Image Feature Extraction

Before running retrieval, you need to extract image features first. The system supports feature extraction for various datasets:

# Flickr30K dataset
python utils/extract_vitfeat_flickr.py \
    --data-dir /path/to/images \
    --save-pt-path /path/to/features.pt \
    --save-url-path /path/to/urls.json

# COCO dataset
python utils/extract_vitfeat_coco.py \
    --data-dir /path/to/images \
    --save-pt-path /path/to/features.pt \
    --save-url-path /path/to/urls.json

Model Training

Model training supports various configurations and dataset combinations. Main training scripts are located in the scripts/ directory:

run_frozenvis_cliploss.sh: Pre-training script using CLIP Loss
run_uniir.sh: Unified image retrieval training script
run-ft.sh: Fine-tuning training script

Training process supports:

Multi-node distributed training
Mixed precision training (bf16)
Gradient checkpointing
Flash Attention acceleration

Please refer to each script file for detailed configurations.

Evaluation Metrics

The system supports the following evaluation metrics:

Recall@K (R@K): Proportion of correct answers in the top K results
Median Rank: Median rank
Mean Rank: Mean rank

Dataset Support

Flickr30K: 30,000 images with 5 captions per image
COCO: Microsoft COCO dataset
CIRR: Compositional Image Retrieval dataset
Fashion-IQ: Fashion image retrieval dataset
VIST: Visual Storytelling dataset

Model Architecture

The system adopts a generative retrieval architecture with the following main components:

Visual Encoder: Frozen visual encoder based on CLIP ViT
Language Model: LLaMA-based decoder supporting generative retrieval
Projection Layer: Projects visual features into language model space
Retrieval Heads: Text head and vision head for computing similarity

Model Configuration

The project supports various model configurations. Configuration files are located in the emu_models/ directory:

Emu-8B_frozenvis_cliploss.json: Base configuration (ViT-B)
Emu-8B_frozenvis_cliploss_vitl.json: Using ViT-L visual encoder

Acknowledgments

This project is based on the following open-source projects:

LLaVA: Large Language and Vision Assistant
Emu: Multimodal foundation model
CLIP: Vision-language pre-trained model
LLaMA: Large language model

Citation

If you use this project, please cite the following paper:

@article{zhao2025chatsearch,
  title={Chatsearch: A dataset and a generative retrieval model for general conversational image retrieval},
  author={Zhao, Zijia and Guo, Longteng and Yue, Tongtian and Hu, Erdong and Shao, Shuai and Yuan, Zehuan and Huang, Hua and Liu, Jing},
  journal={Pattern Recognition},
  year={2025},
  publisher={Elsevier}
}

Related Skills

node-connect

347.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

108.0k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。