SkillAgentSearch skills...

MetaCLIP

NeurIPS 2025 Spotlight; ICLR2024 Spotlight; CVPR 2024; EMNLP 2024

Install / Use

/learn @facebookresearch/MetaCLIP
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Meta CLIP

FAIR, Meta

arXiv arXiv Hugging Face Collection Open In Colab Hugging Face Spaces

<img src="docs/metaclip2_scaling.gif" style="width: 50%; margin: 0 auto; display: block;" /> <img src="docs/metaclip2_teaser.png" style="width: 80%; margin: 0 auto; display: block;" />

After years of advancements in English-centric CLIP development, Meta CLIP 2 is now taking the next step: scaling CLIP to worldwide data. The effort addresses long-standing challenges:

  • large-scale non-English data curation pipelines are largely undeveloped;
  • the curse of multilinguality, where English performance often degrades in multilingual CLIP compared to English-only CLIP.

With a complete recipe for worldwide CLIP—spanning data curation, modeling, and training—we show that English and non-English worlds can mutually benefit and elevate each other, achieving SoTA multilingual performance.

Updates

Quick Start

The pre-trained MetaCLIP models are available in

<details> <summary>mini_clip (this repo)</summary>
import torch
from PIL import Image
from src.mini_clip.factory import create_model_and_transforms, get_tokenizer


model, _, preprocess = create_model_and_transforms('ViT-H-14-quickgelu-worldwide@WorldWideCLIP', pretrained='metaclip2_worldwide')
tokenize = get_tokenizer("facebook/xlm-v-base")

image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0)
text = tokenize(["a diagram", "a dog", "a cat"])

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
</details> <details> <summary>Huggingface</summary>
from PIL import Image
from transformers import AutoProcessor, AutoModel


# Meta CLIP 1
processor = AutoProcessor.from_pretrained("facebook/metaclip-b32-400m")
model = AutoModel.from_pretrained("facebook/metaclip-b32-400m")

# Meta CLIP 2
# model = AutoModel.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu")
# processor = AutoProcessor.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu")

image = Image.open("docs/CLIP.png")
inputs = processor(text=["a diagram", "a dog", "a cat"], images=image, return_tensors="pt", padding=True)

with torch.no_grad():
  outputs = model(**inputs)
  logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
  text_probs = logits_per_image.softmax(dim=-1)
print("Label probs:", text_probs)
</details>

Pre-trained Models

Meta CLIP closely adhere to OpenAI CLIP training and model setup (you mostly just need to replace the weights): to promote rigorous ablation studies and advance scientific understanding, as in the old "era of ImageNet".

Meta CLIP 2

| model_name | pretrained | Tokenizer | Data Card | # of Seen Pairs | Res. | CVQA-LOCAL ZS Acc. | |:--------------------|:-------------|:----------|:---------:|:---------:|:---------:|:--------------:| | ViT-H-14-quickgelu-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 224 | 57.4 | | ViT-H-14-378-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 378 | 58.2 | | ViT-bigG-14-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 224 | 60.7 | | ViT-bigG-14-378-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 378 | 62.0 |

Meta CLIP 2 Distilled | model_name | pretrained | Tokenizer | Data Card | # of Seen Pairs | Res. | CVQA-LOCAL ZS Acc. | |:--------------------|:-------------|:----------|:---------:|:---------:|:---------:|:---------:| | ViT-S-16-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base| Online Curation | 29B | 224 | 46.9 | | ViT-S-16-384-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base| Online Curation | 29B | 384 | 47.4 | | ViT-S-16-mT5-worldwide@mT5WorldWideCLIP | metaclip2_worldwide | google/siglip-so400m-patch16-256-i18n | Online Curation | 29B | 224 | 42.8 | | ViT-M-16-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 224 | 49.3 | | ViT-M-16-384-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 384 | 50.7 | | ViT-M-16-mT5-worldwide@mT5WorldWideCLIP | metaclip2_worldwide | google/siglip-so400m-patch16-256-i18n | Online Curation | 29B | 224 | 48.7 | | ViT-B-32-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base| Online Curation | 29B | 224 | 49.1 | | ViT-B-32-384-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base| Online Curation | 29B | 384 | 50.0 | | ViT-B-32-mT5-worldwide@mT5WorldWideCLIP | metaclip2_worldwide | google/siglip-so400m-patch16-256-i18n | Online Curation | 29B | 224 | 48.4 | | ViT-B-16-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 224 | 50.9 | | ViT-B-16-384-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 384 | 51.5 | | ViT-L-14-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 224 | 56.5 |

Meta CLIP 1

| model_name | pretrained | Data Card | # of Seen Pairs | Res. | GPUs | IN ZS Acc. | |:--------------------|:-------------|:---------:|:---------:|:---------:|:---------:|:--------------:| | ViT-B-32-quickgelu | metaclip_400m | data card | 12.8B | 224 | 64 x V100 | 65.5 | | ViT-B-16-quickgelu | metaclip_400m | data card | 12.8B | 224 | 64 x V100 | 70.8 | | ViT-L-14-quickgelu | metaclip_400m | data card | 12.8B | 224 |

View on GitHub
GitHub Stars1.8k
CategoryDevelopment
Updated1d ago
Forks76

Languages

Python

Security Score

80/100

Audited on Mar 27, 2026

No findings