MyCLIP

手动实现一个多模态模型CLIP，仿真OpenAI的CLIP模型 Handcraft a multimodal CLIP model from scratch, simulating OpenAI's CLIP.

Generate Convert Improve

Install / Use

/learn @siminzheng/MyCLIP

About this skill

Quality Score

0/100

README

MyCLIP: 轻量级图像-文本对比学习模型



                            ███╗   ███╗██╗   ██╗ ██████╗██╗     ██╗██████╗ 
                            ████╗ ████║╚██╗ ██╔╝██╔════╝██║     ██║██╔══██╗
                            ██╔████╔██║ ╚████╔╝ ██║     ██║     ██║██████╔╝
                            ██║╚██╔╝██║  ╚██╔╝  ██║     ██║     ██║██╔═══╝ 
                            ██║ ╚═╝ ██║   ██║   ╚██████╗███████╗██║██║     
                            ╚═╝     ╚═╝   ╚═╝    ╚═════╝╚══════╝╚═╝╚═╝

项目简介

本项目实现了一个简化版的 CLIP 模型，结合了 Vision Transformer (ViT) 和 Transformer 文本编码器，支持图像与文本的对比学习。使用 CIFAR-10 数据集进行训练和测试，支持 accelerate + DeepSpeed 多卡加速和混合精度。
This project implements a simplified version of the CLIP model that combines a Vision Transformer (ViT) with a Transformer text encoder to support contrastive learning between images and text. It is trained and evaluated using the CIFAR-10 dataset, with support for multi-GPU acceleration and mixed precision through accelerate and DeepSpeed.

项目目录结构


MyCLIP/
├── ViT/ # ViT目录
│ └── MyViT.py #从零开始实现的一个ViT, 后期更新会将它作为MyClip中的自建ViT模型，以替换现有的ViT模型
├── checkpoints/ # 训练模型保存目录
├── encoders/ # 编码器模块
│ ├── vision_encoder.py
│ └── text_encoder.py
├── models/ # CLIP模型主体
│ └── clip_model.py
├── data/ # 数据集及加载
│ └── dataset.py
├── utils/ # 工具函数
│ ├── tokenizer_utils.py
│ └── loss_utils.py
├── configs
| └── accelerate_config.yaml #配置文件
├── train.py # 训练脚本
├── evaluate.py # 测试脚本
├── requirements.txt # 环境依赖
└── README.md # 项目说明

环境安装

pip install -r requirements.txt

启动训练使用 Accelerate 启动，支持多卡和混合精度（需提前配置 accelerate config 或修改命令参数）

accelerate launch train.py

如果需要 DeepSpeed 支持，请先配置 DeepSpeed 配置文件，并用如下命令(配置文件 accelerate_config.yaml 中的配置为1个机器，2张卡，默认使用DeepSpeed ZeRO2加速)：

accelerate launch --config_file ./configs/accelerate_config.yaml train.py

启动测试训练完成后，执行：

python evaluate.py

备注训练使用 CIFAR-10 图像和对应类别名文本对进行对比学习

文本Tokenizer基于 Huggingface BERT Tokenizer

图像编码器基于 torchvision ViT Base 模型（非预训练）

训练过程自动保存模型检查点至 checkpoints/

评估时加载最后一轮模型权重（默认clip_epoch_10.pt）

欢迎反馈和讨论！

Notes:

Training uses CIFAR-10 images and their corresponding class name text pairs for contrastive learning.

The text tokenizer is based on the Huggingface BERT Tokenizer.

The image encoder is based on the torchvision ViT-Base model (without pre-training).

Model checkpoints are automatically saved to the checkpoints/ directory during training.

During evaluation, the final model weights (default: clip_epoch_10.pt) are loaded.

Feedback and discussions are warmly welcome!

Related Skills

node-connect

334.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

82.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

334.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

82.1k

Commit, push, and open a PR