SkillAgentSearch skills...

Hqclip

[ICCV 2025] HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets

Install / Use

/learn @w1oves/Hqclip
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

[ICCV 2025] HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets 🚀

Paper PDF Project Page Demo Dataset VLM-150M Dataset VLM-1B

Authors
Zhixiang Wei<sup>1</sup><sup>2</sup>, Guangting Wang<sup>2</sup> et al.
<sup>1</sup> University of Science and Technology of China
<sup>2</sup> WeChat Vision, Tencent Inc.


🔍 Key Contributions

  • 🏭 Efficient Data Generation Pipeline
    Multi-grained annotation pipeline using Large Vision-Language Models (LVLMs)
  • 🗂️ High-Quality Image-Text Datasets
    Generated by state-of-the-art LVLMs with positive/negative examples and rich text descriptions:
  • 🧠 HQ-CLIP Training Framework
    Novel CLIP training paradigm extending contrastive learning with:
    • Negative description supervision
    • Short tag augmentation

Model Overview


Model Zoo

|Model|Pretrained|ImageNet Top-1|DataComp Score| |--|--|--|--| |CLIP-B-16|VLM-150M-Medium|70.6|58.6| |CLIP-L-14-CLIPA|VLM-1B|78.6|63.8| |CLIP-L-14-OPENAI|VLM-1B|76.5|63.7|

Recaption Model: Qwen2VL

Datasets

|Dataset|Samples|URL| |--|--|--| |VLM-150M|147M|https://huggingface.co/datasets/zhixiangwei/VLM-150M| |VLM-1B|1.37B|https://huggingface.co/datasets/zhixiangwei/VLM-1B|

Dataset Usage Guide

Preparation Steps

  1. (Optional) Download CommonPool Foundation Datasets
    Access CommonPool Large and XLarge versions via:
    CommonPool GitHub Repository

  2. Acquire DFN Base Datasets
    Download DFN Large and XLarge from:
    DFN Hugging Face Datasets

  3. Download HQ-CLIP Datasets
    Obtain our enhanced datasets:

    • VLM-150M
    • VLM-1B

Integration Instructions

Each JSON entry in VLM-150M and VLM-1B corresponds directly to a DFN dataset UID through matching filenames. To utilize our enhanced annotations:

  • Option 1: Direct Caption Replacement
    Substitute the original DFN captions with our generated annotations

  • Option 2: Dynamic Data Loading
    Modify the Open CLIP dataloader to load our annotations during training runtime

🔜 Detailed implementation guidance will be published in future releases.

Model Loading Instructions

Our uploaded weights are compatible with both open_clip and huggingface transformers.

For open_clip users:

import open_clip

Initialize model with transforms

model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(
    'hf-hub:zhixiangwei/vlm150m-hqclip-large-vitb16'
)
tokenizer = open_clip.get_tokenizer(
    'hf-hub:zhixiangwei/vlm150m-hqclip-large-vitb16'
)

For Hugging Face Transformers users:

from transformers import AutoModel

Load model directly from hub

model = AutoModel.from_pretrained(
    'zhixiangwei/vlm150m-hqclip-large-vitb16'
)

📝 Citation

@misc{hqclip,
      title={HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models}, 
      author={Zhixiang Wei and Guangting Wang and Xiaoxiao Ma and Ke Mei and Huaian Chen and Yi Jin and Fengyun Rao},
      year={2025},
      eprint={2507.22431},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.22431}, 
}

🙏 Acknowledgments

These works have greatly inspired us, providing us with codebases, data, and support. We thank their authors!

Related Skills

View on GitHub
GitHub Stars64
CategoryDevelopment
Updated23d ago
Forks2

Security Score

100/100

Audited on Mar 9, 2026

No findings