Meta CLIP

After years of advancements in English-centric CLIP development, Meta CLIP 2 is now taking the next step: scaling CLIP to worldwide data. The effort addresses long-standing challenges:

large-scale non-English data curation pipelines are largely undeveloped;
the curse of multilinguality, where English performance often degrades in multilingual CLIP compared to English-only CLIP.

With a complete recipe for worldwide CLIP—spanning data curation, modeling, and training—we show that English and non-English worlds can mutually benefit and elevate each other, achieving SoTA multilingual performance.

Updates

11/21/2025: distillled models, training/eval code for Meta CLIP 2 are out.
09/18/2025: 🔥 paper Meta CLIP 2 (worldwide) accepted by NeurIPS 2025 as a spotlight presentation.
08/25/2025: Meta CLIP 2 (worldwide) is on open_clip and Huggingface.
07/29/2025: paper Meta CLIP 2: A Worldwide Scaling Recipe (aka Meta CLIP 2 worldwide) is released.
12/10/2024: Meta CLIP 1.2 (ViT-H/14) trained with Altogether synthetic captions is released.
10/09/2024: Altogether: Image Captioning via Re-aligning Alt-text (aka Meta CLIP 1.2) is accepted by EMNLP 2024 with code released.
08/15/2024: v0.1 released.
04/25/2024: paper MoDE: CLIP Data Experts via Clustering is accepted by CVPR 2024 with code released.
01/18/2024: add code for building metadata.
01/16/2024: paper Demystifying CLIP Data accepted by ICLR as spotlight presentation.
12/25/2023: Huggingface Space demo and Colab released.
12/21/2023: Meta CLIP 1.1 (ViT-G/14) released.
09/28/2023: initial release.

Quick Start

The pre-trained MetaCLIP models are available in

import torch
from PIL import Image
from src.mini_clip.factory import create_model_and_transforms, get_tokenizer


model, _, preprocess = create_model_and_transforms('ViT-H-14-quickgelu-worldwide@WorldWideCLIP', pretrained='metaclip2_worldwide')
tokenize = get_tokenizer("facebook/xlm-v-base")

image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0)
text = tokenize(["a diagram", "a dog", "a cat"])

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

</details> <details> <summary>Huggingface</summary>

from PIL import Image
from transformers import AutoProcessor, AutoModel


# Meta CLIP 1
processor = AutoProcessor.from_pretrained("facebook/metaclip-b32-400m")
model = AutoModel.from_pretrained("facebook/metaclip-b32-400m")

# Meta CLIP 2
# model = AutoModel.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu")
# processor = AutoProcessor.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu")

image = Image.open("docs/CLIP.png")
inputs = processor(text=["a diagram", "a dog", "a cat"], images=image, return_tensors="pt", padding=True)

with torch.no_grad():
  outputs = model(**inputs)
  logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
  text_probs = logits_per_image.softmax(dim=-1)
print("Label probs:", text_probs)

</details>

Pre-trained Models

Meta CLIP closely adhere to OpenAI CLIP training and model setup (you mostly just need to replace the weights): to promote rigorous ablation studies and advance scientific understanding, as in the old "era of ImageNet".

Meta CLIP 2

| model_name | pretrained | Tokenizer | Data Card | # of Seen Pairs | Res. | CVQA-LOCAL ZS Acc. | |:--------------------|:-------------|:----------|:---------:|:---------:|:---------:|:--------------:| | ViT-H-14-quickgelu-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 224 | 57.4 | | ViT-H-14-378-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 378 | 58.2 | | ViT-bigG-14-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 224 | 60.7 | | ViT-bigG-14-378-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 378 | 62.0 |

Meta CLIP 2 Distilled | model_name | pretrained | Tokenizer | Data Card | # of Seen Pairs | Res. | CVQA-LOCAL ZS Acc. | |:--------------------|:-------------|:----------|:---------:|:---------:|:---------:|:---------:| | ViT-S-16-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base| Online Curation | 29B | 224 | 46.9 | | ViT-S-16-384-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base| Online Curation | 29B | 384 | 47.4 | | ViT-S-16-mT5-worldwide@mT5WorldWideCLIP | metaclip2_worldwide | google/siglip-so400m-patch16-256-i18n | Online Curation | 29B | 224 | 42.8 | | ViT-M-16-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 224 | 49.3 | | ViT-M-16-384-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 384 | 50.7 | | ViT-M-16-mT5-worldwide@mT5WorldWideCLIP | metaclip2_worldwide | google/siglip-so400m-patch16-256-i18n | Online Curation | 29B | 224 | 48.7 | | ViT-B-32-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base| Online Curation | 29B | 224 | 49.1 | | ViT-B-32-384-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base| Online Curation | 29B | 384 | 50.0 | | ViT-B-32-mT5-worldwide@mT5WorldWideCLIP | metaclip2_worldwide | google/siglip-so400m-patch16-256-i18n | Online Curation | 29B | 224 | 48.4 | | ViT-B-16-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 224 | 50.9 | | ViT-B-16-384-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 384 | 51.5 | | ViT-L-14-worldwide@WorldWideCLIP | metaclip2_worldwide | facebook/xlm-v-base | Online Curation | 29B | 224 | 56.5 |

Meta CLIP 1

| model_name | pretrained | Data Card | # of Seen Pairs | Res. | GPUs | IN ZS Acc. | |:--------------------|:-------------|:---------:|:---------:|:---------:|:---------:|:--------------:| | ViT-B-32-quickgelu | metaclip_400m | data card | 12.8B | 224 | 64 x V100 | 65.5 | | ViT-B-16-quickgelu | metaclip_400m | data card | 12.8B | 224 | 64 x V100 | 70.8 | | ViT-L-14-quickgelu | metaclip_400m | data card | 12.8B | 224 |

MetaCLIP

Install / Use

README

Meta CLIP

Updates

Quick Start

Pre-trained Models