Vjepa2
PyTorch code and models for VJEPA2 self-supervised learning from video.
Install / Use
/learn @facebookresearch/Vjepa2README
🆕 [2026-03-16]: :fire: V-JEPA 2.1 is released :fire: A new familly of models trained with a novel recipe that learns high quality and temporolly consistent dense features !!!
[2025-06-25]: V-JEPA 2 is released. [Blog]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Meta FAIR
Mahmoud Assran∗, Adrien Bardes∗, David Fan∗, Quentin Garrido∗, Russell Howes∗, Mojtaba Komeili∗, Matthew Muckley∗, Ammar Rizvi∗, Claire Roberts∗, Koustuv Sinha∗, Artem Zholus*, Sergio Arnaud*, Abha Gejji*, Ada Martin*, Francois Robert Hogan*, Daniel Dugas*, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma, Sarath Chandar, Franziska Meier*, Yann LeCun*, Michael Rabbat*, Nicolas Ballas*
*Core Team
Official Pytorch codebase for V-JEPA 2, V-JEPA 2-AC, V-JEPA 2.1.
V-JEPA 2 is a self-supervised approach to training video encoders, using internet-scale video data, that attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
<p align="center"> <img src="assets/flowchart.png" width=100%> </p>V-JEPA 2.1 Pre-training
Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mahmoud Assran, Koustuv Sinha, Michael Rabbat, Yann LeCun, Nicolas Ballas, Adrien Bardes
V-JEPA 2.1 improves the training recipe to focus on learning high-quality and temporally consistent dense features, as higlighted by PCA visualizations:
<p align="center"> <img src="assets/teaser_screenshot_5dice.png" width=100%> </p>The V-JEPA 2.1 approach leverages: (1) Dense Predictive Loss, a masking-based self-supervision objective where all tokens (both visible/context and masked tokens) contribute to the self-supervised training loss; (2) Deep Self-Supervision, which applies the self-supervised loss at multiple intermediate representations of the encoder models; (3) Multi-Modal Tokenizers for images and videos; and we show that our approach benefit from (4) Model and data scaling.
<p align="center"> <img src="assets/architecture_vjepa2_1.jpg" width=100%> </p>V-JEPA 2.1 performance across dense and global prediction tasks:
<p align="center"> <img src="assets/bars_teaser_tikz-1.png" width=100%> </p>V-JEPA 2 Pre-training
(Top) The encoder and predictor are pre-trained through self-supervised learning from video using a masked latent feature prediction objective, leveraging abundant natural videos to bootstrap physical world understanding and prediction. (Bottom) Performance of V-JEPA 2 on downstream understanding and prediction tasks.
<img align="left" src="https://github.com/user-attachments/assets/914942d8-6a1e-409d-86ff-ff856b7346ab" width=65%>
<table> <tr> <th colspan="1">Benchmark</th> <th colspan="1">V-JEPA 2</th> <th colspan="1">Previous Best</th> </tr> <tr> <td>EK100</td> <td>39.7%</td> <td>27.6% (PlausiVL)</td> </tr> <tr> <td>SSv2 (Probe)</td> <td>77.3%</td> <td>69.7% (InternVideo2-1B)</td> </tr> <tr> <td>Diving48 (Probe)</td> <td>90.2%</td> <td>86.4% (InternVideo2-1B)</td> </tr> <tr> <td>MVP (Video QA)</td> <td>44.5%</td> <td>39.9% (InternVL-2.5)</td> </tr> <tr> <td>TempCompass (Video QA)</td> <td>76.9%</td> <td>75.3% (Tarsier 2)</td> </tr> </table>V-JEPA 2-AC Post-training
(Top) After post-training with a small amount of robot data, we can deploy the model on a robot arm in new environments, and tackle foundational tasks like reaching, grasping, and pick-and-place by planning from image goals. (Bottom) Performance on robot manipulation tasks using a Franka arm, with input provided through a monocular RGB camera.
<img align="left" src="https://github.com/user-attachments/assets/c5d42221-0102-4216-911d-061a4369a805" width=65%>
<table> <tr> <th colspan="1"></th> <th colspan="1"></th> <th colspan="2">Grasp</th> <th colspan="2">Pick-and-Place</th> </tr> <tr> <th colspan="1">Method</th> <th colspan="1">Reach</th> <th colspan="1">Cup</th> <th colspan="1">Box</th> <th colspan="1">Cup</th> <th colspan="1">Box</th> </tr> <tr> <td>Octo</td> <td>100%</td> <td>10%</td> <td>0%</td> <td>10%</td> <td>10%</td> </tr> <tr> <td>Cosmos</td> <td>80%</td> <td>0%</td> <td>20%</td> <td>0%</td> <td>0%</td> </tr> <tr> <td>VJEPA 2-AC</td> <td>100%</td> <td>60%</td> <td>20%</td> <td>80%</td> <td>50%</td> </tr> </table>Models
V-JEPA 2 and V-JEPA 2.1
HuggingFace
See our HuggingFace collection for V-JEPA 2.
V-JEPA 2 Pretrained Checkpoints
<table> <tr> <th colspan="1">Model</th> <th colspan="1">#Parameters</th> <th colspan="1">Resolution</th> <th colspan="1">Download Link</th> <th colspan="1">Pretraining Config</th> </tr> <tr> <td>ViT-L/16</td> <td>300M</td> <td>256</td> <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vitl.pt">checkpoint</a></td> <td><a href="configs/train/vitl16">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>600M</td> <td>256</td> <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vith.pt">checkpoint</a></td> <td><a href="configs/train/vith16/">configs</a></td> </tr> <tr> <td>ViT-g/16</td> <td>1B</td> <td>256</td> <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vitg.pt">checkpoint</a></td> <td><a href="configs/train/vitg16">configs</a></td> </tr> <tr> <td>ViT-g/16<sub>384</sub></td> <td>1B</td> <td>384</td> <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vitg-384.pt">checkpoint</a></td> <td><a href="configs/train/vitg16">configs</a></td> </tr> </table>V-JEPA 2.1 Pretrained Checkpoints
<table> <tr> <th colspan="1">Model</th> <th colspan="1">#Parameters</th> <th colspan="1">Resolution</th> <th colspan="1">Download Link</th> <th colspan="1">Pretraining Config</th> </tr> <tr> <td>ViT-B/16</td> <td>80M</td> <td>384</td> <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vjepa2_1_vitb_dist_vitG_384.pt">checkpoint</a></td> <td><a href="configs/train_2_1/vitb16">configs</a></td> </tr> <tr> <td>ViT-L/16</td> <td>300M</td> <td>384</td> <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vjepa2_1_vitl_dist_vitG_384.pt">checkpoint</a></td> <td><a href="configs/train_2_1/vitl16">configs</a></td> </tr> <tr> <td>ViT-g/16</td> <td>1B</td> <td>384</td> <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vjepa2_1_vitg_384.pt">checkpoint</a></td> <td><a href="configs/train_2_1/vitg16">configs</a></td> </tr> <tr> <td>ViT-G/16</td> <td>2B</td> <td>384</td> <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vjepa2_1_vitG_384.pt">checkpoint</a></td> <td><a href="configs/train_2_1/vitG16">configs</a></td> </tr> </table>Pretrained backbones (via PyTorch Hub)
Please install Pytorch, timm and einops locally, then run the following to load each model. Installing Pytorch with CUDA support is strongly recommended.
import torch
# preprocessor
processor = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_preprocessor')
# models
# V-JEPA 2
vjepa2_vit_large = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_large')
vjepa2_vit_huge = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_huge')
vjepa2_vit_giant = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_giant')
vjepa2_vit_giant_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_giant_384')
# V-JEPA 2.1
vjepa2_1_vit_base_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_1_vit_base_384')
vjepa2_1_vit_large_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_1_vit_large_384')
vjepa2_1_vit_giant_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_1_vit_giant_384')
vjepa2_1_vit_gigantic_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_1_vit_gigantic_384')
Pretrained checkpoints on Huggingface
You can also use our pretrained checkpoints on Huggingface for V-JEPA 2.
from transformers import AutoVideoProcessor, AutoModel
hf_repo = "facebook/vjepa2-vitg-fpc64-256"
# facebook/vjepa2-vitl-fpc64-256
# facebook/vjepa2-vith-fpc64-256
# facebook/vjepa2-vitg-fpc64-256
# facebook/vjepa2-vitg-fpc64-384
model = AutoModel.from_pretrained(hf_repo)
processor = AutoVideoProcessor.from_pretrained(hf_repo)
Evaluation Attentive Probes
We share the trained attentive probes for two of our visual understanding evals (Something-Something v2 and Diving48) and the action anticipation eval EPIC-KITCHENS-100.
<table> <tr> <th colspan="1">Model</th> <th colspan="4">SSv2</th> <th colspan="4">Diving48</th> <th colspan="4">EK100</th> </tr> <tr> <th colspan="1"></th> <th colspan="1">Checkpoint</th> <th colspan="1">Training Config</th> <th colspan="1">Inference Config</th> <th colspan="1">Result</th> <th colspan="1">Checkpoint</th> <th colspan="1">Training Config</th> <th colspan=