SkillAgentSearch skills...

Vjepa2

PyTorch code and models for VJEPA2 self-supervised learning from video.

Install / Use

/learn @facebookresearch/Vjepa2
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

🆕 [2026-03-16]: :fire: V-JEPA 2.1 is released :fire: A new familly of models trained with a novel recipe that learns high quality and temporolly consistent dense features !!!

[2025-06-25]: V-JEPA 2 is released. [Blog]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Meta FAIR

Mahmoud Assran∗, Adrien Bardes∗, David Fan∗, Quentin Garrido∗, Russell Howes∗, Mojtaba Komeili∗, Matthew Muckley∗, Ammar Rizvi∗, Claire Roberts∗, Koustuv Sinha∗, Artem Zholus*, Sergio Arnaud*, Abha Gejji*, Ada Martin*, Francois Robert Hogan*, Daniel Dugas*, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma, Sarath Chandar, Franziska Meier*, Yann LeCun*, Michael Rabbat*, Nicolas Ballas*

*Core Team

[Paper] [Blog] [BibTex]

Official Pytorch codebase for V-JEPA 2, V-JEPA 2-AC, V-JEPA 2.1.

V-JEPA 2 is a self-supervised approach to training video encoders, using internet-scale video data, that attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.

<p align="center"> <img src="assets/flowchart.png" width=100%> </p>

V-JEPA 2.1 Pre-training

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mahmoud Assran, Koustuv Sinha, Michael Rabbat, Yann LeCun, Nicolas Ballas, Adrien Bardes

[Paper] [BibTex]

V-JEPA 2.1 improves the training recipe to focus on learning high-quality and temporally consistent dense features, as higlighted by PCA visualizations:

<p align="center"> <img src="assets/teaser_screenshot_5dice.png" width=100%> </p>

The V-JEPA 2.1 approach leverages: (1) Dense Predictive Loss, a masking-based self-supervision objective where all tokens (both visible/context and masked tokens) contribute to the self-supervised training loss; (2) Deep Self-Supervision, which applies the self-supervised loss at multiple intermediate representations of the encoder models; (3) Multi-Modal Tokenizers for images and videos; and we show that our approach benefit from (4) Model and data scaling.

<p align="center"> <img src="assets/architecture_vjepa2_1.jpg" width=100%> </p>

V-JEPA 2.1 performance across dense and global prediction tasks:

<p align="center"> <img src="assets/bars_teaser_tikz-1.png" width=100%> </p>

V-JEPA 2 Pre-training

(Top) The encoder and predictor are pre-trained through self-supervised learning from video using a masked latent feature prediction objective, leveraging abundant natural videos to bootstrap physical world understanding and prediction. (Bottom) Performance of V-JEPA 2 on downstream understanding and prediction tasks.

<img align="left" src="https://github.com/user-attachments/assets/914942d8-6a1e-409d-86ff-ff856b7346ab" width=65%> 

<table> <tr> <th colspan="1">Benchmark</th> <th colspan="1">V-JEPA 2</th> <th colspan="1">Previous Best</th> </tr> <tr> <td>EK100</td> <td>39.7%</td> <td>27.6% (PlausiVL)</td> </tr> <tr> <td>SSv2 (Probe)</td> <td>77.3%</td> <td>69.7% (InternVideo2-1B)</td> </tr> <tr> <td>Diving48 (Probe)</td> <td>90.2%</td> <td>86.4% (InternVideo2-1B)</td> </tr> <tr> <td>MVP (Video QA)</td> <td>44.5%</td> <td>39.9% (InternVL-2.5)</td> </tr> <tr> <td>TempCompass (Video QA)</td> <td>76.9%</td> <td>75.3% (Tarsier 2)</td> </tr> </table>

V-JEPA 2-AC Post-training

(Top) After post-training with a small amount of robot data, we can deploy the model on a robot arm in new environments, and tackle foundational tasks like reaching, grasping, and pick-and-place by planning from image goals. (Bottom) Performance on robot manipulation tasks using a Franka arm, with input provided through a monocular RGB camera.

<img align="left" src="https://github.com/user-attachments/assets/c5d42221-0102-4216-911d-061a4369a805" width=65%> 

<table> <tr> <th colspan="1"></th> <th colspan="1"></th> <th colspan="2">Grasp</th> <th colspan="2">Pick-and-Place</th> </tr> <tr> <th colspan="1">Method</th> <th colspan="1">Reach</th> <th colspan="1">Cup</th> <th colspan="1">Box</th> <th colspan="1">Cup</th> <th colspan="1">Box</th> </tr> <tr> <td>Octo</td> <td>100%</td> <td>10%</td> <td>0%</td> <td>10%</td> <td>10%</td> </tr> <tr> <td>Cosmos</td> <td>80%</td> <td>0%</td> <td>20%</td> <td>0%</td> <td>0%</td> </tr> <tr> <td>VJEPA 2-AC</td> <td>100%</td> <td>60%</td> <td>20%</td> <td>80%</td> <td>50%</td> </tr> </table>

Models

V-JEPA 2 and V-JEPA 2.1

HuggingFace

See our HuggingFace collection for V-JEPA 2.

V-JEPA 2 Pretrained Checkpoints

<table> <tr> <th colspan="1">Model</th> <th colspan="1">#Parameters</th> <th colspan="1">Resolution</th> <th colspan="1">Download Link</th> <th colspan="1">Pretraining Config</th> </tr> <tr> <td>ViT-L/16</td> <td>300M</td> <td>256</td> <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vitl.pt">checkpoint</a></td> <td><a href="configs/train/vitl16">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>600M</td> <td>256</td> <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vith.pt">checkpoint</a></td> <td><a href="configs/train/vith16/">configs</a></td> </tr> <tr> <td>ViT-g/16</td> <td>1B</td> <td>256</td> <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vitg.pt">checkpoint</a></td> <td><a href="configs/train/vitg16">configs</a></td> </tr> <tr> <td>ViT-g/16<sub>384</sub></td> <td>1B</td> <td>384</td> <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vitg-384.pt">checkpoint</a></td> <td><a href="configs/train/vitg16">configs</a></td> </tr> </table>

V-JEPA 2.1 Pretrained Checkpoints

<table> <tr> <th colspan="1">Model</th> <th colspan="1">#Parameters</th> <th colspan="1">Resolution</th> <th colspan="1">Download Link</th> <th colspan="1">Pretraining Config</th> </tr> <tr> <td>ViT-B/16</td> <td>80M</td> <td>384</td> <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vjepa2_1_vitb_dist_vitG_384.pt">checkpoint</a></td> <td><a href="configs/train_2_1/vitb16">configs</a></td> </tr> <tr> <td>ViT-L/16</td> <td>300M</td> <td>384</td> <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vjepa2_1_vitl_dist_vitG_384.pt">checkpoint</a></td> <td><a href="configs/train_2_1/vitl16">configs</a></td> </tr> <tr> <td>ViT-g/16</td> <td>1B</td> <td>384</td> <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vjepa2_1_vitg_384.pt">checkpoint</a></td> <td><a href="configs/train_2_1/vitg16">configs</a></td> </tr> <tr> <td>ViT-G/16</td> <td>2B</td> <td>384</td> <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vjepa2_1_vitG_384.pt">checkpoint</a></td> <td><a href="configs/train_2_1/vitG16">configs</a></td> </tr> </table>

Pretrained backbones (via PyTorch Hub)

Please install Pytorch, timm and einops locally, then run the following to load each model. Installing Pytorch with CUDA support is strongly recommended.

import torch

# preprocessor
processor = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_preprocessor')
# models
# V-JEPA 2
vjepa2_vit_large = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_large')
vjepa2_vit_huge = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_huge')
vjepa2_vit_giant = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_giant')
vjepa2_vit_giant_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_giant_384')
# V-JEPA 2.1
vjepa2_1_vit_base_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_1_vit_base_384')
vjepa2_1_vit_large_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_1_vit_large_384')
vjepa2_1_vit_giant_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_1_vit_giant_384')
vjepa2_1_vit_gigantic_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_1_vit_gigantic_384')

Pretrained checkpoints on Huggingface

You can also use our pretrained checkpoints on Huggingface for V-JEPA 2.

from transformers import AutoVideoProcessor, AutoModel

hf_repo = "facebook/vjepa2-vitg-fpc64-256"
# facebook/vjepa2-vitl-fpc64-256
# facebook/vjepa2-vith-fpc64-256
# facebook/vjepa2-vitg-fpc64-256
# facebook/vjepa2-vitg-fpc64-384

model = AutoModel.from_pretrained(hf_repo)
processor = AutoVideoProcessor.from_pretrained(hf_repo)

Evaluation Attentive Probes

We share the trained attentive probes for two of our visual understanding evals (Something-Something v2 and Diving48) and the action anticipation eval EPIC-KITCHENS-100.

<table> <tr> <th colspan="1">Model</th> <th colspan="4">SSv2</th> <th colspan="4">Diving48</th> <th colspan="4">EK100</th> </tr> <tr> <th colspan="1"></th> <th colspan="1">Checkpoint</th> <th colspan="1">Training Config</th> <th colspan="1">Inference Config</th> <th colspan="1">Result</th> <th colspan="1">Checkpoint</th> <th colspan="1">Training Config</th> <th colspan=
View on GitHub
GitHub Stars3.6k
CategoryContent
Updated4h ago
Forks415

Languages

Python

Security Score

95/100

Audited on Apr 5, 2026

No findings