ChatSearch
ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval
Install / Use
/learn @CASIA-IVA-Lab/ChatSearchREADME
Chatsearch: A dataset and a generative retrieval model for general conversational image retrieval
A unified image retrieval system based on large multimodal models, supporting general conversational image retrieval tasks.
Introduction
UniChatIR is the official implementation of the Chatsearch paper, a generative image retrieval system based on the Emu/LLaVA architecture. The system leverages large multimodal models (LMMs) to achieve high-quality text-to-image retrieval and conversational image retrieval. The system adopts a generative retrieval paradigm, generating unified image representations through large language models, and supports various retrieval tasks and datasets, including:
- Standard Image Retrieval: Flickr30K, COCO, etc.
- Compositional Image Retrieval: CIRR (Compositional Image Retrieval)
- Fashion Image Retrieval: Fashion-IQ
- Visual Story Retrieval: VIST (Visual Storytelling)
- Conversational Image Retrieval: Supports multi-turn conversational context understanding
Key Features
- 🎯 Multi-task Support: Supports various image retrieval tasks and datasets
- 🚀 Generative Retrieval: Adopts a generative retrieval paradigm, leveraging large language models to generate unified image representations
- 💬 Conversational Retrieval: Supports multi-turn conversational context understanding for general conversational image retrieval
- 🔧 Easy to Use: Provides simple command-line interfaces and Gradio demo interface
- 📊 Flexible Configuration: Supports various model configurations and evaluation metrics
- 🎨 Multimodal Fusion: Unified architecture based on CLIP visual encoder and LLaMA language model
Installation
Requirements
- Python >= 3.8
- PyTorch >= 1.12.0
- CUDA >= 11.0 (recommended)
Installation Steps
- Clone the repository:
git clone https://github.com/CASIA-IVA-Lab/ChatSearch.git
cd ChatSearch
- Create a virtual environment and install dependencies:
conda create -n unichatir python=3.10 -y
conda activate unichatir
pip install --upgrade pip
pip install -e .
- Install training-related dependencies (optional):
pip install ninja
pip install flash-attn --no-build-isolation
Quick Start
1. Prepare Data
First, you need to prepare the following data:
- Image Feature Files: Pre-computed image features (.pt format)
- Annotation Files: JSON files containing image IDs and metadata
- Image Directory: Root directory of image files
2. Run Demo
Use the Gradio interface for interactive image retrieval:
python demo.py \
--model-cfg emu_models/Emu-8B_frozenvis_cliploss.json \
--checkpoint /path/to/checkpoint.pth \
--image-feat-path /path/to/image_features.pt \
--annotation-path /path/to/annotations.json \
--image-root /path/to/images
3. Evaluate Model
Evaluate model performance on standard datasets:
python utils/retrieval_new.py \
--checkpoint /path/to/checkpoint.pth \
--model-cfg emu_models/Emu-8B_frozenvis_cliploss_vitl.json \
--vis-roots /path/to/images1,/path/to/images2 \
--ann-paths /path/to/ann1.json,/path/to/ann2.json \
--bs 16 \
--evaluate
Project Structure
unichatir/
├── demo.py # Gradio demo interface
├── utils/
│ ├── retrieval_new.py # Standard image retrieval evaluation
│ ├── retrieval_new_cirr.py # CIRR dataset evaluation
│ ├── retrieval_new_fashion.py # Fashion-IQ dataset evaluation
│ ├── retrieval_new_vist.py # VIST dataset evaluation
│ └── extract_vitfeat_*.py # Image feature extraction scripts
├── emu_models/ # Model definitions
│ ├── modeling_uniir.py # Unified image retrieval model (Emu_clip_VIT)
│ ├── modeling_llama.py # LLaMA language model (supports classification and regression)
│ ├── eva_vit.py # EVA ViT visual encoder
│ └── ...
├── llava/ # LLaVA related code
│ ├── dataset_finetune.py # Dataset definitions
│ ├── dataset_cirr.py # CIRR dataset
│ ├── processors/ # Data processors
│ └── train/ # Training scripts
└── scripts/ # Training and evaluation scripts
Usage
Image Feature Extraction
Before running retrieval, you need to extract image features first. The system supports feature extraction for various datasets:
# Flickr30K dataset
python utils/extract_vitfeat_flickr.py \
--data-dir /path/to/images \
--save-pt-path /path/to/features.pt \
--save-url-path /path/to/urls.json
# COCO dataset
python utils/extract_vitfeat_coco.py \
--data-dir /path/to/images \
--save-pt-path /path/to/features.pt \
--save-url-path /path/to/urls.json
Model Training
Model training supports various configurations and dataset combinations. Main training scripts are located in the scripts/ directory:
run_frozenvis_cliploss.sh: Pre-training script using CLIP Lossrun_uniir.sh: Unified image retrieval training scriptrun-ft.sh: Fine-tuning training script
Training process supports:
- Multi-node distributed training
- Mixed precision training (bf16)
- Gradient checkpointing
- Flash Attention acceleration
Please refer to each script file for detailed configurations.
Evaluation Metrics
The system supports the following evaluation metrics:
- Recall@K (R@K): Proportion of correct answers in the top K results
- Median Rank: Median rank
- Mean Rank: Mean rank
Dataset Support
- Flickr30K: 30,000 images with 5 captions per image
- COCO: Microsoft COCO dataset
- CIRR: Compositional Image Retrieval dataset
- Fashion-IQ: Fashion image retrieval dataset
- VIST: Visual Storytelling dataset
Model Architecture
The system adopts a generative retrieval architecture with the following main components:
- Visual Encoder: Frozen visual encoder based on CLIP ViT
- Language Model: LLaMA-based decoder supporting generative retrieval
- Projection Layer: Projects visual features into language model space
- Retrieval Heads: Text head and vision head for computing similarity
Model Configuration
The project supports various model configurations. Configuration files are located in the emu_models/ directory:
Emu-8B_frozenvis_cliploss.json: Base configuration (ViT-B)Emu-8B_frozenvis_cliploss_vitl.json: Using ViT-L visual encoder
Acknowledgments
This project is based on the following open-source projects:
- LLaVA: Large Language and Vision Assistant
- Emu: Multimodal foundation model
- CLIP: Vision-language pre-trained model
- LLaMA: Large language model
Citation
If you use this project, please cite the following paper:
@article{zhao2025chatsearch,
title={Chatsearch: A dataset and a generative retrieval model for general conversational image retrieval},
author={Zhao, Zijia and Guo, Longteng and Yue, Tongtian and Hu, Erdong and Shao, Shuai and Yuan, Zehuan and Huang, Hua and Liu, Jing},
journal={Pattern Recognition},
year={2025},
publisher={Elsevier}
}
Related Skills
node-connect
347.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
