DepictQA: Depicted Image Quality Assessment with Vision Language Models

<img src="docs/logo.png" width="600"> <a href="https://depictqa.github.io/" target="_blank">🌏 Project Page</a> • 📀 Datasets ( <a href="https://huggingface.co/datasets/zhiyuanyou/DataDepictQA" target="_blank">huggingface</a> / <a href="https://modelscope.cn/datasets/zhiyuanyou/DataDepictQA" target="_blank">modelscope</a> )

Official pytorch implementation of the papers:

DepictQA-Wild (DepictQA-v2), also named Enhanced DepictQA (EDQA): paper, project page.

Zhiyuan You, Jinjin Gu, Xin Cai, Zheyuan Li, Kaiwen Zhu, Chao Dong, Tianfan Xue, "Enhancing Descriptive Image Quality Assessment with A Large-scale Multi-modal Dataset," TIP, 2025.
DepictQA-v1: paper, project page.

Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, Chao Dong, "Depicting beyond scores: Advancing image quality assessment through multi-modal language models," ECCV, 2024.

Update

📆 [2025.11] DepictQA-Wild (DepictQA-v2), also named Enhanced DepictQA (EDQA), was accepted to TIP.

📆 [2025.02] DeQA-Score was accepted to CVPR 2025.

📆 [2025.01] We released DeQA-Score, a distribution-based depicted image quality assessment model for score regression. Datasets, codes, and model weights (full tuning / LoRA tuning) were available.

📆 [2024.07] DepictQA datasets were released in <a href="https://huggingface.co/datasets/zhiyuanyou/DataDepictQA" target="_blank">huggingface</a> / <a href="https://modelscope.cn/datasets/zhiyuanyou/DataDepictQA" target="_blank">modelscope</a>.

📆 [2024.07] DepictQA-v1 was accepted to ECCV 2024.

📆 [2024.05] We released DepictQA-Wild (DepictQA-v2): a multi-functional in-the-wild descriptive image quality assessment model.

📆 [2023.12] We released DepictQA-v1, a multi-modal image quality assessment model based on vision language models.

Installation

Create environment.

# clone this repo
git clone https://github.com/XPixelGroup/DepictQA.git
cd DepictQA

# create environment
conda create -n depictqa python=3.10
conda activate depictqa
pip install -r requirements.txt

Download pretrained models.
- CLIP-ViT-L-14. Required.
- Vicuna-v1.5-7B. Required.
- All-MiniLM-L6-v2. Required only for confidence estimation of detailed reasoning responses.
- Our pretrained delta checkpoint (see Models). Optional for training. Required for demo and inference.
Ensure that all downloaded models are placed in the designated directories as follows.
```
|-- DepictQA
|-- ModelZoo
    |-- CLIP
        |-- clip
            |-- ViT-L-14.pt
    |-- LLM
        |-- vicuna
            |-- vicuna-7b-v1.5
    |-- SentenceTransformers
        |-- all-MiniLM-L6-v2
```
If models are stored in different directories, revise config.model.vision_encoder_path, config.model.llm_path, and config.model.sentence_model in config.yaml (under the experiments directory) to set new paths.
Move our pretrained delta checkpoint to a specific experiment directory (e.g., DQ495K, DQ495K_QPath) as follows.
```
|-- DepictQA
    |-- experiments
        |-- a_specific_experiment_directory
            |-- ckpt
                |-- ckpt.pt
```
If the delta checkpoint is stored in another directory, revise config.model.delta_path in config.yaml (under the experiments directory) to set new path.

Models

| Training Data | Tune | Hugging Face | Description | | -------- | -------- | -------- | -------- | | DQ-495K + KonIQ + SPAQ | Abstractor, LORA | download | Vision abstractor to reduce token numbers. Trained on DQ-495K, KonIQ, and SPAQ datasets. Able to handle images with resolution larger than 1000+, and able to compare images with different contents. | | DQ-495K + Q-Instruct | Projector, LORA, | download | Trained on DQ-495K and Q-Instruct (see paper) datasets. Able to complete multiple-choice, yes-or-no, what, how questions, but degrades in assessing and comparison tasks. | | DQ-495K + Q-Pathway | Projector, LORA | download | Trained on DQ-495K and Q-Pathway (see paper) datasets. Performs well on real images, but degrades in comparison tasks. | | DQ-495K | Projector, LORA | download | Trained on DQ-495K dataset. Used in our paper. |

Demos

Online Demo

We provide an online demo (coming soon) deployed on huggingface spaces.

Gradio Demo

We provide a gradio demo for local test.

cd a specific experiment directory: cd experiments/a_specific_experiment_directory
Check Installation to make sure (1) the environment is installed, (2) CLIP-ViT-L-14, Vicuna-v1.5-7B, and the pretrained delta checkpoint are downloaded and (3) their paths are set in config.yaml.
Launch controller: sh launch_controller.sh
Launch gradio server: sh launch_gradio.sh
Launch DepictQA worker: sh launch_worker.sh id_of_one_gpu

You can revise the server config in serve.yaml. The url of deployed demo will be http://{serve.gradio.host}:{serve.gradio.port}. The default url is http://0.0.0.0:12345 if you do not revise serve.yaml.

Note that multiple workers can be launched simultaneously. For each worker, serve.worker.host, serve.worker.port, serve.worker.worker_url, and serve.worker.model_name should be unique.

Datasets

Source codes for DQ-495K (used in DepictQA-v2) dataset construction are provided in here.
Download MBAPPS (used in DepictQA-v1) and DQ-495K (used in DepictQA-v2) datasets from <a href="https://huggingface.co/datasets/zhiyuanyou/DataDepictQA" target="_blank">huggingface</a> / <a href="https://modelscope.cn/datasets/zhiyuanyou/DataDepictQA" target="_blank">modelscope</a>. Move the dataset to the same directory of this repository as follows.
```
|-- DataDepictQA
|-- DepictQA
```
If the dataset is stored in another directory, revise config.data.root_dir in config.yaml (under the experiments directory) to set new path.

Training

cd a specific experiment directory: cd experiments/a_specific_experiment_directory
Check Installation to make sure (1) the environment is installed, (2) CLIP-ViT-L-14 and Vicuna-v1.5-7B are downloaded and (3) their paths are set in config.yaml.
Run training: sh train.sh ids_of_gpus.

Inference

Inference on Our Benchmark

cd a specific experiment directory: cd experiments/a_specific_experiment_directory
Check Installation to make sure (1) the environment is installed, (2) CLIP-ViT-L-14, Vicuna-v1.5-7B, and the pretrained delta checkpoint are downloaded and (3) their paths are set in config.yaml.
Run a specific inference script (e.g., infer_A_brief.sh): sh infer_A_brief.sh id_of_one_gpu.

Inference on Custom Dataset

Construct *.json file for your dataset as follows.

[
    {
        "id": unique id of each sample, required, 
        "image_ref": reference image, null if not applicable, 
        "image_A": image A, null if not applicable, 
        "image_B": image B, null if not applicable, 
        "query": input question, required, 
    }, 
    ...
]

cd your experiment directory: cd your_experiment_directory
Check Installation to make sure (1) the environment is installed, (2) CLIP-ViT-L-14, Vicuna-v1.5-7B, and the pretrained delta checkpoint are downloaded and (3) their paths are set in config.yaml.
Construct your inference script as follows.
```
#!/bin/bash
src_dir=directory_of_src
export PYTHONPATH=$src_dir:$PYTHONPATH
export CUDA_VISIBLE_DEVICES=$1

python $src_dir/infer.py \
    --meta_path json_path_1_of_your_dataset \
                json_path_2_of_your_dataset \
    --dataset_name your_dataset_name_1 \
                   your_dataset_name_2 \
    --task_name task_name \
    --batch_size batch_size \
```
--task_name can be set as follows.

| Task Name | Description | | -------- | -------- | | quality_compare | AB comparison in full-reference | | quality_compare_noref | AB comparison in non-reference | | quality_single_A | Image A assessment in full-reference | | quality_single_A_noref | Image A assessment in non-reference | | quality_single_B | Image B assessment in full-reference | | quality_single_B_noref

DepictQA

Install / Use

README