Hqclip
[ICCV 2025] HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets
Install / Use
/learn @w1oves/HqclipREADME
[ICCV 2025] HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets 🚀
Authors
Zhixiang Wei<sup>1</sup><sup>2</sup>, Guangting Wang<sup>2</sup> et al.
<sup>1</sup> University of Science and Technology of China
<sup>2</sup> WeChat Vision, Tencent Inc.
🔍 Key Contributions
- 🏭 Efficient Data Generation Pipeline
Multi-grained annotation pipeline using Large Vision-Language Models (LVLMs) - 🗂️ High-Quality Image-Text Datasets
Generated by state-of-the-art LVLMs with positive/negative examples and rich text descriptions: - 🧠 HQ-CLIP Training Framework
Novel CLIP training paradigm extending contrastive learning with:- Negative description supervision
- Short tag augmentation
Model Zoo
|Model|Pretrained|ImageNet Top-1|DataComp Score| |--|--|--|--| |CLIP-B-16|VLM-150M-Medium|70.6|58.6| |CLIP-L-14-CLIPA|VLM-1B|78.6|63.8| |CLIP-L-14-OPENAI|VLM-1B|76.5|63.7|
Recaption Model: Qwen2VL
Datasets
|Dataset|Samples|URL| |--|--|--| |VLM-150M|147M|https://huggingface.co/datasets/zhixiangwei/VLM-150M| |VLM-1B|1.37B|https://huggingface.co/datasets/zhixiangwei/VLM-1B|
Dataset Usage Guide
Preparation Steps
-
(Optional) Download CommonPool Foundation Datasets
Access CommonPool Large and XLarge versions via:
CommonPool GitHub Repository -
Acquire DFN Base Datasets
Download DFN Large and XLarge from:
DFN Hugging Face Datasets -
Download HQ-CLIP Datasets
Obtain our enhanced datasets:- VLM-150M
- VLM-1B
Integration Instructions
Each JSON entry in VLM-150M and VLM-1B corresponds directly to a DFN dataset UID through matching filenames. To utilize our enhanced annotations:
-
Option 1: Direct Caption Replacement
Substitute the original DFN captions with our generated annotations -
Option 2: Dynamic Data Loading
Modify the Open CLIP dataloader to load our annotations during training runtime
🔜 Detailed implementation guidance will be published in future releases.
Model Loading Instructions
Our uploaded weights are compatible with both open_clip and huggingface transformers.
For open_clip users:
import open_clip
Initialize model with transforms
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(
'hf-hub:zhixiangwei/vlm150m-hqclip-large-vitb16'
)
tokenizer = open_clip.get_tokenizer(
'hf-hub:zhixiangwei/vlm150m-hqclip-large-vitb16'
)
For Hugging Face Transformers users:
from transformers import AutoModel
Load model directly from hub
model = AutoModel.from_pretrained(
'zhixiangwei/vlm150m-hqclip-large-vitb16'
)
📝 Citation
@misc{hqclip,
title={HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models},
author={Zhixiang Wei and Guangting Wang and Xiaoxiao Ma and Ke Mei and Huaian Chen and Yi Jin and Fengyun Rao},
year={2025},
eprint={2507.22431},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.22431},
}
🙏 Acknowledgments
These works have greatly inspired us, providing us with codebases, data, and support. We thank their authors!
Related Skills
node-connect
344.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
96.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
Security Score
Audited on Mar 9, 2026
