Hqclip

[ICCV 2025] HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets

Generate Convert Improve

Install / Use

/learn @w1oves/Hqclip

About this skill

Quality Score

0/100

README

[ICCV 2025] HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets 🚀

Authors
Zhixiang Wei12, Guangting Wang2 et al.
1 University of Science and Technology of China
2 WeChat Vision, Tencent Inc.

🔍 Key Contributions

🏭 Efficient Data Generation Pipeline
Multi-grained annotation pipeline using Large Vision-Language Models (LVLMs)
🗂️ High-Quality Image-Text Datasets
Generated by state-of-the-art LVLMs with positive/negative examples and rich text descriptions:
- VLM-1B: Billion-scale dataset
- VLM-150M: Curated 150M samples
🧠 HQ-CLIP Training Framework
Novel CLIP training paradigm extending contrastive learning with:
- Negative description supervision
- Short tag augmentation

Model Overview

Model Zoo

|Model|Pretrained|ImageNet Top-1|DataComp Score| |--|--|--|--| |CLIP-B-16|VLM-150M-Medium|70.6|58.6| |CLIP-L-14-CLIPA|VLM-1B|78.6|63.8| |CLIP-L-14-OPENAI|VLM-1B|76.5|63.7|

Recaption Model: Qwen2VL

Datasets

|Dataset|Samples|URL| |--|--|--| |VLM-150M|147M|https://huggingface.co/datasets/zhixiangwei/VLM-150M| |VLM-1B|1.37B|https://huggingface.co/datasets/zhixiangwei/VLM-1B|

Dataset Usage Guide

Preparation Steps

(Optional) Download CommonPool Foundation Datasets
Access CommonPool Large and XLarge versions via:
CommonPool GitHub Repository
Acquire DFN Base Datasets
Download DFN Large and XLarge from:
DFN Hugging Face Datasets
Download HQ-CLIP Datasets
Obtain our enhanced datasets:
- VLM-150M
- VLM-1B

Integration Instructions

Each JSON entry in VLM-150M and VLM-1B corresponds directly to a DFN dataset UID through matching filenames. To utilize our enhanced annotations:

Option 1: Direct Caption Replacement
Substitute the original DFN captions with our generated annotations
Option 2: Dynamic Data Loading
Modify the Open CLIP dataloader to load our annotations during training runtime

🔜 Detailed implementation guidance will be published in future releases.

Model Loading Instructions

Our uploaded weights are compatible with both open_clip and huggingface transformers.

For open_clip users:

import open_clip

Initialize model with transforms

model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(
    'hf-hub:zhixiangwei/vlm150m-hqclip-large-vitb16'
)
tokenizer = open_clip.get_tokenizer(
    'hf-hub:zhixiangwei/vlm150m-hqclip-large-vitb16'
)

For Hugging Face Transformers users:

from transformers import AutoModel

Load model directly from hub

model = AutoModel.from_pretrained(
    'zhixiangwei/vlm150m-hqclip-large-vitb16'
)

📝 Citation

@misc{hqclip,
      title={HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models}, 
      author={Zhixiang Wei and Guangting Wang and Xiaoxiao Ma and Ke Mei and Huaian Chen and Yi Jin and Fengyun Rao},
      year={2025},
      eprint={2507.22431},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.22431}, 
}

🙏 Acknowledgments

These works have greatly inspired us, providing us with codebases, data, and support. We thank their authors!

Related Skills

node-connect

344.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

96.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。