SemiCLIP

This is PyTorch implementation of "Semi-Supervised CLIP Adaptation by Enforcing Semantic and Trapezoidal Consistency" (ICLR 2025).

Generate Convert Improve

Install / Use

/learn @Gank0078/SemiCLIP

About this skill

Quality Score

0/100

README

[ICLR 2025] Semi-Supervised CLIP Adaptation by Enforcing Semantic and Trapezoidal Consistency

This is PyTorch implementation of Semi-Supervised CLIP Adaptation by Enforcing Semantic and Trapezoidal Consistency at ICLR 2025.

Abstract

Vision-language pre-training models, such as CLIP, have demonstrated strong capability in rapidly adapting to downstream tasks through fine-tuning, and have been widely applied across various tasks. However, when the downstream tasks are constrained by limited image-text paired data, CLIP struggles to effectively address the domain gap between the pre-training and the target tasks. To address this limitation, we propose a novel semi-supervised CLIP training method coined SEMICLIP that leverages a small amount of image-text pairs alongside a large volume of images without text descriptions to enhance CLIP’s cross-modal alignment. To effectively utilize unlabeled images, we introduce semantic concept mining to improve task-specific visual representations by matching images with relevant concepts mined from labeled data. Leveraging matched semantic concepts, we construct learnable surrogate captions for unlabeled images and optimize a trapezoidal consistency to regulate the geometric structure of image-text pairs in the representation space. Experimental results demonstrate that our approach significantly improves the adaptability of CLIP in target tasks with limited labeled data, achieving gains ranging from 1.72% – 6.58% for zero-shot classification accuracy and 2.32% – 3.23% for image-text retrieval performance on standard benchmarks.

Method

1738834873509

Environment

conda create -n semiclip python=3.7
conda activate semiclip
pip install -r requirements.txt

Dataset

The directory structure for datasets looks like:

Path/To/Dataset
├─ aerial
│  ├─ RSICD
|  ├─ UCM_captions
|  ├─ Sydney_captions
│  ├─......
├─ fashion
│  ├─ fashion200k
│  ├─ FashionGen
│  └─ PolyvoreOutfits
└─ ......

Start Training

Train our proposed SemiCLIP for different settings.

For remote sensing datasets

# For supervised pre-training
bash scripts/train_rs_stage1.sh

# For semi-supervised fine-tuning
bash scripts/train_rs_stage2.sh

For fashion datasets:

# For supervised pre-training
bash scripts/train_fashion_stage1.sh

# For semi-supervised fine-tuning
bash scripts/train_fashion_stage2.sh

Acknowledgement

Our code of SemiCLIP is based on the implementation of S-CLIP. We thank the authors of the S-CLIP for making their code available to the public.

Related Skills

node-connect

345.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

104.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

345.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

345.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。