Jepa
PyTorch code and models for V-JEPA self-supervised learning from video.
Install / Use
/learn @facebookresearch/JepaREADME
V-JEPA: Video Joint Embedding Predictive Architecture
Official PyTorch codebase for the video joint-embedding predictive architecture, V-JEPA, a method for self-supervised learning of visual representations from video.
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran*, Nicolas Ballas*
[Blog] [Paper] [Yannic Kilcher's Video]
V-JEPA models are trained by passively watching video pixels from the VideoMix2M dataset, and produce versatile visual representations that perform well on downstream video and image tasks, without adaption of the model’s parameters; e.g., using a frozen backbone and only a light-weight task-specific attentive probe.
Method
V-JEPA pretraining is based solely on an unsupervised feature prediction objective, and does not utilize pretrained image encoders, text, negative examples, human annotations, or pixel-level reconstruction.
<img src="https://github.com/facebookresearch/jepa/assets/7530871/72df7ef0-2ef5-48bb-be46-27963db91f3d" width=40%>       <img src="https://github.com/facebookresearch/jepa/assets/7530871/f26b2e96-0227-44e2-b058-37e7bf1e10db" width=40%>Visualizations
As opposed to generative methods that have a pixel decoder, V-JEPA has a predictor that makes predictions in latent space. We train a conditional diffusion model to decode the V-JEPA feature-space predictions to interpretable pixels; the pretrained V-JEPA encoder and predictor networks are kept frozen in this process. The decoder is only fed the representations predicted for the missing regions of the video, and does not have access to the unmasked regions of the video.
The V-JEPA feature predictions are indeed grounded, and exhibit spatio-temporal consistency with the unmasked regions of the video.
<img src="https://github.com/facebookresearch/jepa/assets/7530871/8bb5e338-0db8-4532-ba6f-fc62729acc26" width=90%> <br/> <img src="https://github.com/facebookresearch/jepa/assets/7530871/93e15a3b-9119-4149-ac88-4e6288f2043d" width=22%> <img src="https://github.com/facebookresearch/jepa/assets/7530871/7efd2ee2-2aa0-4065-a4a6-12f1d9d0499c" width=22%> <img src="https://github.com/facebookresearch/jepa/assets/7530871/06626018-cd5a-4536-9d0e-de58506ce5ed" width=22%> <img src="https://github.com/facebookresearch/jepa/assets/7530871/766da53a-e6b8-4f94-82c8-9a53b4764358" width=22%> <br/>MODEL ZOO
Pretrained models
<table> <tr> <th colspan="1">model</th> <th colspan="1">patch size</th> <th colspan="1">resolution</th> <th colspan="1">iterations</th> <th colspan="1">batch size</th> <th colspan="1">data</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L</td> <td>2x16x16</td> <td>224x224</td> <td>90K</td> <td>3072</td> <td>VideoMix2M</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/vitl16.pth.tar">checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/pretrain/vitl16.yaml">configs</a></td> </tr> <tr> <td>ViT-H</td> <td>2x16x16</td> <td>224x224</td> <td>90K</td> <td>3072</td> <td>VideoMix2M</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/vith16.pth.tar">checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/pretrain/vith16.yaml">configs</a></td> </tr> <tr> <td>ViT-H</td> <td>2x16x16</td> <td>384x384</td> <td>90K</td> <td>2400</td> <td>VideoMix2M</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/vith16-384.pth.tar">checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/pretrain/vith16_384.yaml">configs</a></td> </tr> </table>K400 Attentive probes
<table> <tr> <th colspan="1">model</th> <th colspan="1">resolution</th> <th colspan="1">accuracy (16x8x3)</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L/16</td> <td>224x224</td> <td>80.8</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/k400-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vitl16_k400_16x8x3.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>224x224</td> <td>82.0</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/k400-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_k400_16x8x3.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>384x384</td> <td>81.9</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/k400-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_k400_16x8x3.yaml">configs</a></td> </tr> </table>SSv2 Attentive probes
<table> <tr> <th colspan="1">model</th> <th colspan="1">resolution</th> <th colspan="1">accuracy (16x2x3)</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L/16</td> <td>224x224</td> <td>69.5</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/ssv2-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vitl16_ssv2_16x2x3.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>224x224</td> <td>71.4</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/ssv2-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_ssv2_16x2x3.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>384x384</td> <td>72.2</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/ssv2-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_ssv2_16x2x3.yaml">configs</a></td> </tr> </table>ImageNet1K Attentive probes
<table> <tr> <th colspan="1">model</th> <th colspan="1">resolution</th> <th colspan="1">accuracy</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L/16</td> <td>224x224</td> <td>74.8</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/in1k-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vitl16_in1k.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>224x224</td> <td>75.9</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/in1k-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_in1k.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>384x384</td> <td>77.4</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/in1k-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_in1k.yaml">configs</a></td> </tr> </table>Places205 Attentive probes
<table> <tr> <th colspan="1">model</th> <th colspan="1">resolution</th> <th colspan="1">accuracy</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L/16</td> <td>224x224</td> <td>60.3</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/places-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vitl16_places.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>224x224</td> <td>61.7</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/places-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_places.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>384x384</td> <td>62.8</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/places-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_places.yaml">configs</a></td> </tr> </table>iNat21 Attentive probes
<table> <tr> <th colspan="1">model</th> <th colspan="1">resolution</th> <th colspan="1">accuracy</th> <th colspan="2">download</th> </tr> <tr> <td>ViT-L/16</td> <td>224x224</td> <td>67.8</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vitl16/inat-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vitl16_inat.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>224x224</td> <td>67.9</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16/inat-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_inat.yaml">configs</a></td> </tr> <tr> <td>ViT-H/16</td> <td>384x384</td> <td>72.6</td> <td><a href="https://dl.fbaipublicfiles.com/jepa/vith16-384/inat-probe.pth.tar">attentive probe checkpoint</a></td> <td><a href="https://github.com/facebookresearch/jepa/blob/master/configs/evals/vith16_384_inat.yaml">configs</a></td> </tr> </table>Code Structure
Config files: All experiment parameters are specified in config files (as opposed to command-li
