V-JEPA: Video Joint Embedding Predictive Architecture

Official PyTorch codebase for the video joint-embedding predictive architecture, V-JEPA, a method for self-supervised learning of visual representations from video.

Meta AI Research, FAIR

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran*, Nicolas Ballas*

[Blog] [Paper] [Yannic Kilcher's Video]

V-JEPA models are trained by passively watching video pixels from the VideoMix2M dataset, and produce versatile visual representations that perform well on downstream video and image tasks, without adaption of the model’s parameters; e.g., using a frozen backbone and only a light-weight task-specific attentive probe.

Method

V-JEPA pretraining is based solely on an unsupervised feature prediction objective, and does not utilize pretrained image encoders, text, negative examples, human annotations, or pixel-level reconstruction.

Visualizations

As opposed to generative methods that have a pixel decoder, V-JEPA has a predictor that makes predictions in latent space. We train a conditional diffusion model to decode the V-JEPA feature-space predictions to interpretable pixels; the pretrained V-JEPA encoder and predictor networks are kept frozen in this process. The decoder is only fed the representations predicted for the missing regions of the video, and does not have access to the unmasked regions of the video.

The V-JEPA feature predictions are indeed grounded, and exhibit spatio-temporal consistency with the unmasked regions of the video.

MODEL ZOO

Pretrained models

<table> <tr> <th colspan="1">model</th> <th colspan="1">patch size</th> <th colspan="1">resolution</th> <th colspan="1">iterations</th> <th colspan="1">batch size</th> <th colspan="1">data</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L</td> <td>2x16x16</td> <td>224x224</td> <td>90K</td> <td>3072</td> <td>VideoMix2M</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/vitl16.pth.tar">checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/pretrain/vitl16.yaml">configs</a></td> </tr> <tr> <td>ViT-H</td> <td>2x16x16</td> <td>224x224</td> <td>90K</td> <td>3072</td> <td>VideoMix2M</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/vith16.pth.tar">checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/pretrain/vith16.yaml">configs</a></td> </tr> <tr> <td>ViT-H</td> <td>2x16x16</td> <td>384x384</td> <td>90K</td> <td>2400</td> <td>VideoMix2M</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/vith16-384.pth.tar">checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/pretrain/vith16_384.yaml">configs</a></td> </tr> </table>

K400 Attentive probes

<table> <tr> <th colspan="1">model</th> <th colspan="1">resolution</th> <th colspan="1">accuracy (16x8x3)</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L/16</td> <td>224x224</td> <td>80.8</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/k400-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vitl16_k400_16x8x3.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>224x224</td> <td>82.0</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/k400-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_k400_16x8x3.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>384x384</td> <td>81.9</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/k400-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_k400_16x8x3.yaml">configs</a></td> </tr> </table>

SSv2 Attentive probes

<table> <tr> <th colspan="1">model</th> <th colspan="1">resolution</th> <th colspan="1">accuracy (16x2x3)</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L/16</td> <td>224x224</td> <td>69.5</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/ssv2-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vitl16_ssv2_16x2x3.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>224x224</td> <td>71.4</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/ssv2-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_ssv2_16x2x3.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>384x384</td> <td>72.2</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/ssv2-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_ssv2_16x2x3.yaml">configs</a></td> </tr> </table>

ImageNet1K Attentive probes

<table> <tr> <th colspan="1">model</th> <th colspan="1">resolution</th> <th colspan="1">accuracy</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L/16</td> <td>224x224</td> <td>74.8</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/in1k-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vitl16_in1k.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>224x224</td> <td>75.9</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/in1k-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_in1k.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>384x384</td> <td>77.4</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/in1k-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_in1k.yaml">configs</a></td> </tr> </table>

Places205 Attentive probes

<table> <tr> <th colspan="1">model</th> <th colspan="1">resolution</th> <th colspan="1">accuracy</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L/16</td> <td>224x224</td> <td>60.3</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/places-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vitl16_places.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>224x224</td> <td>61.7</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/places-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_places.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>384x384</td> <td>62.8</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/places-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_places.yaml">configs</a></td> </tr> </table>

iNat21 Attentive probes

<table> <tr> <th colspan="1">model</th> <th colspan="1">resolution</th> <th colspan="1">accuracy</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L/16</td> <td>224x224</td> <td>67.8</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/inat-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vitl16_inat.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>224x224</td> <td>67.9</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/inat-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_inat.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>384x384</td> <td>72.6</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/inat-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_inat.yaml">configs</a></td> </tr> </table>

Code Structure

Config files: All experiment parameters are specified in config files (as opposed to command-li

Jepa

Install / Use

README