MetaCLIP
NeurIPS 2025 Spotlight; ICLR2024 Spotlight; CVPR 2024; EMNLP 2024
Install / Use
/learn @facebookresearch/MetaCLIPREADME
Meta CLIP
<img src="docs/metaclip2_scaling.gif" style="width: 50%; margin: 0 auto; display: block;" /> <img src="docs/metaclip2_teaser.png" style="width: 80%; margin: 0 auto; display: block;" />After years of advancements in English-centric CLIP development, Meta CLIP 2 is now taking the next step: scaling CLIP to worldwide data. The effort addresses long-standing challenges:
- large-scale non-English data curation pipelines are largely undeveloped;
- the curse of multilinguality, where English performance often degrades in multilingual CLIP compared to English-only CLIP.
With a complete recipe for worldwide CLIP—spanning data curation, modeling, and training—we show that English and non-English worlds can mutually benefit and elevate each other, achieving SoTA multilingual performance.
Updates
- 11/21/2025: distillled models, training/eval code for Meta CLIP 2 are out.
- 09/18/2025: 🔥 paper Meta CLIP 2 (worldwide) accepted by NeurIPS 2025 as a spotlight presentation.
- 08/25/2025: Meta CLIP 2 (worldwide) is on open_clip and Huggingface.
- 07/29/2025: paper Meta CLIP 2: A Worldwide Scaling Recipe (aka Meta CLIP 2 worldwide) is released.
- 12/10/2024: Meta CLIP 1.2 (ViT-H/14) trained with Altogether synthetic captions is released.
- 10/09/2024: Altogether: Image Captioning via Re-aligning Alt-text (aka Meta CLIP 1.2) is accepted by EMNLP 2024 with code released.
- 08/15/2024: v0.1 released.
- 04/25/2024: paper MoDE: CLIP Data Experts via Clustering is accepted by CVPR 2024 with code released.
- 01/18/2024: add code for building metadata.
- 01/16/2024: paper Demystifying CLIP Data accepted by ICLR as spotlight presentation.
- 12/25/2023: Huggingface Space demo and Colab released.
- 12/21/2023: Meta CLIP 1.1 (ViT-G/14) released.
- 09/28/2023: initial release.
Quick Start
The pre-trained MetaCLIP models are available in
<details> <summary>mini_clip (this repo)</summary>import torch
from PIL import Image
from src.mini_clip.factory import create_model_and_transforms, get_tokenizer
model, _, preprocess = create_model_and_transforms('ViT-H-14-quickgelu-worldwide@WorldWideCLIP', pretrained='metaclip2_worldwide')
tokenize = get_tokenizer("facebook/xlm-v-base")
image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0)
text = tokenize(["a diagram", "a dog", "a cat"])
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
</details>
<details>
<summary>Huggingface</summary>
from PIL import Image
from transformers import AutoProcessor, AutoModel
# Meta CLIP 1
processor = AutoProcessor.from_pretrained("facebook/metaclip-b32-400m")
model = AutoModel.from_pretrained("facebook/metaclip-b32-400m")
# Meta CLIP 2
# model = AutoModel.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu")
# processor = AutoProcessor.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu")
image = Image.open("docs/CLIP.png")
inputs = processor(text=["a diagram", "a dog", "a cat"], images=image, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
text_probs = logits_per_image.softmax(dim=-1)
print("Label probs:", text_probs)
</details>
Pre-trained Models
Meta CLIP closely adhere to OpenAI CLIP training and model setup (you mostly just need to replace the weights): to promote rigorous ablation studies and advance scientific understanding, as in the old "era of ImageNet".
Meta CLIP 2
| model_name | pretrained | Tokenizer | Data Card | # of Seen Pairs | Res. | CVQA-LOCAL ZS Acc. |
|:--------------------|:-------------|:----------|:---------:|:---------:|:---------:|:--------------:|
| ViT-H-14-quickgelu-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 224 | 57.4 |
| ViT-H-14-378-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 378 | 58.2 |
| ViT-bigG-14-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 224 | 60.7 |
| ViT-bigG-14-378-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 378 | 62.0 |
Meta CLIP 2 Distilled
| model_name | pretrained | Tokenizer | Data Card | # of Seen Pairs | Res. | CVQA-LOCAL ZS Acc. |
|:--------------------|:-------------|:----------|:---------:|:---------:|:---------:|:---------:|
| ViT-S-16-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base| Online Curation | 29B | 224 | 46.9 |
| ViT-S-16-384-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base| Online Curation | 29B | 384 | 47.4 |
| ViT-S-16-mT5-worldwide@mT5WorldWideCLIP | metaclip2_worldwide | google/siglip-so400m-patch16-256-i18n | Online Curation | 29B | 224 | 42.8 |
| ViT-M-16-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 224 | 49.3 |
| ViT-M-16-384-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 384 | 50.7 |
| ViT-M-16-mT5-worldwide@mT5WorldWideCLIP | metaclip2_worldwide | google/siglip-so400m-patch16-256-i18n | Online Curation | 29B | 224 | 48.7 |
| ViT-B-32-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base| Online Curation | 29B | 224 | 49.1 |
| ViT-B-32-384-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base| Online Curation | 29B | 384 | 50.0 |
| ViT-B-32-mT5-worldwide@mT5WorldWideCLIP | metaclip2_worldwide | google/siglip-so400m-patch16-256-i18n | Online Curation | 29B | 224 | 48.4 |
| ViT-B-16-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 224 | 50.9 |
| ViT-B-16-384-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 384 | 51.5 |
| ViT-L-14-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 224 | 56.5 |
Meta CLIP 1
| model_name | pretrained | Data Card | # of Seen Pairs | Res. | GPUs | IN ZS Acc. |
|:--------------------|:-------------|:---------:|:---------:|:---------:|:---------:|:--------------:|
| ViT-B-32-quickgelu | metaclip_400m | data card | 12.8B | 224 | 64 x V100 | 65.5 |
| ViT-B-16-quickgelu | metaclip_400m | data card | 12.8B | 224 | 64 x V100 | 70.8 |
| ViT-L-14-quickgelu | metaclip_400m | data card | 12.8B | 224 |
