SkillAgentSearch skills...

Jepa

PyTorch code and models for V-JEPA self-supervised learning from video.

Install / Use

/learn @facebookresearch/Jepa
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

V-JEPA: Video Joint Embedding Predictive Architecture

Official PyTorch codebase for the video joint-embedding predictive architecture, V-JEPA, a method for self-supervised learning of visual representations from video.

Meta AI Research, FAIR

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran*, Nicolas Ballas*

[Blog] [Paper] [Yannic Kilcher's Video]

V-JEPA models are trained by passively watching video pixels from the VideoMix2M dataset, and produce versatile visual representations that perform well on downstream video and image tasks, without adaption of the model’s parameters; e.g., using a frozen backbone and only a light-weight task-specific attentive probe.

Method

V-JEPA pretraining is based solely on an unsupervised feature prediction objective, and does not utilize pretrained image encoders, text, negative examples, human annotations, or pixel-level reconstruction.

<img src="https://github.com/facebookresearch/jepa/assets/7530871/72df7ef0-2ef5-48bb-be46-27963db91f3d" width=40%> &emsp;&emsp;&emsp;&emsp;&emsp; <img src="https://github.com/facebookresearch/jepa/assets/7530871/f26b2e96-0227-44e2-b058-37e7bf1e10db" width=40%>

Visualizations

As opposed to generative methods that have a pixel decoder, V-JEPA has a predictor that makes predictions in latent space. We train a conditional diffusion model to decode the V-JEPA feature-space predictions to interpretable pixels; the pretrained V-JEPA encoder and predictor networks are kept frozen in this process. The decoder is only fed the representations predicted for the missing regions of the video, and does not have access to the unmasked regions of the video.

The V-JEPA feature predictions are indeed grounded, and exhibit spatio-temporal consistency with the unmasked regions of the video.

<img src="https://github.com/facebookresearch/jepa/assets/7530871/8bb5e338-0db8-4532-ba6f-fc62729acc26" width=90%> <br/> <img src="https://github.com/facebookresearch/jepa/assets/7530871/93e15a3b-9119-4149-ac88-4e6288f2043d" width=22%> <img src="https://github.com/facebookresearch/jepa/assets/7530871/7efd2ee2-2aa0-4065-a4a6-12f1d9d0499c" width=22%> <img src="https://github.com/facebookresearch/jepa/assets/7530871/06626018-cd5a-4536-9d0e-de58506ce5ed" width=22%> <img src="https://github.com/facebookresearch/jepa/assets/7530871/766da53a-e6b8-4f94-82c8-9a53b4764358" width=22%> <br/>

MODEL ZOO

Pretrained models

<table> <tr> <th colspan="1">model</th> <th colspan="1">patch size</th> <th colspan="1">resolution</th> <th colspan="1">iterations</th> <th colspan="1">batch size</th> <th colspan="1">data</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L</td> <td>2x16x16</td> <td>224x224</td> <td>90K</td> <td>3072</td> <td>VideoMix2M</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/vitl16.pth.tar">checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/pretrain/vitl16.yaml">configs</a></td> </tr> <tr> <td>ViT-H</td> <td>2x16x16</td> <td>224x224</td> <td>90K</td> <td>3072</td> <td>VideoMix2M</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/vith16.pth.tar">checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/pretrain/vith16.yaml">configs</a></td> </tr> <tr> <td>ViT-H</td> <td>2x16x16</td> <td>384x384</td> <td>90K</td> <td>2400</td> <td>VideoMix2M</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/vith16-384.pth.tar">checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/pretrain/vith16_384.yaml">configs</a></td> </tr> </table>

K400 Attentive probes

<table> <tr> <th colspan="1">model</th> <th colspan="1">resolution</th> <th colspan="1">accuracy (16x8x3)</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L/16</td> <td>224x224</td> <td>80.8</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/k400-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vitl16_k400_16x8x3.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>224x224</td> <td>82.0</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/k400-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_k400_16x8x3.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>384x384</td> <td>81.9</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/k400-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_k400_16x8x3.yaml">configs</a></td> </tr> </table>

SSv2 Attentive probes

<table> <tr> <th colspan="1">model</th> <th colspan="1">resolution</th> <th colspan="1">accuracy (16x2x3)</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L/16</td> <td>224x224</td> <td>69.5</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/ssv2-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vitl16_ssv2_16x2x3.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>224x224</td> <td>71.4</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/ssv2-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_ssv2_16x2x3.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>384x384</td> <td>72.2</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/ssv2-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_ssv2_16x2x3.yaml">configs</a></td> </tr> </table>

ImageNet1K Attentive probes

<table> <tr> <th colspan="1">model</th> <th colspan="1">resolution</th> <th colspan="1">accuracy</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L/16</td> <td>224x224</td> <td>74.8</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/in1k-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vitl16_in1k.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>224x224</td> <td>75.9</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/in1k-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_in1k.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>384x384</td> <td>77.4</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/in1k-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_in1k.yaml">configs</a></td> </tr> </table>

Places205 Attentive probes

<table> <tr> <th colspan="1">model</th> <th colspan="1">resolution</th> <th colspan="1">accuracy</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L/16</td> <td>224x224</td> <td>60.3</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/places-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vitl16_places.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>224x224</td> <td>61.7</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/places-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_places.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>384x384</td> <td>62.8</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/places-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_places.yaml">configs</a></td> </tr> </table>

iNat21 Attentive probes

<table> <tr> <th colspan="1">model</th> <th colspan="1">resolution</th> <th colspan="1">accuracy</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L/16</td> <td>224x224</td> <td>67.8</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/inat-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vitl16_inat.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>224x224</td> <td>67.9</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/inat-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_inat.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>384x384</td> <td>72.6</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/inat-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_inat.yaml">configs</a></td> </tr> </table>

Code Structure

Config files: All experiment parameters are specified in config files (as opposed to command-li

View on GitHub
GitHub Stars3.7k
CategoryContent
Updated4h ago
Forks373

Languages

Python

Security Score

80/100

Audited on Apr 3, 2026

No findings