Efficient Universal Perception Encoder (EUPE)

Chenchen Zhu, Saksham Suri, Cijo Jose, Maxime Oquab, Marc Szafraniec, Wei Wen, Yunyang Xiong, Patrick Labatut, Piotr Bojanowski, Raghuraman Krishnamoorthi, Vikas Chandra

[ :scroll: Paper] [ 🤗 HF] [ :book: BibTeX]

Reference PyTorch implementation and models for EUPE. For details, see the EUPE paper.

Overview

<i></em><br/>Applying our distillation recipe (EUPE) to ViT-B gives a well-balanced universal encoder that excels at diverse task domains compared to both ViT-B domain experts and existing agglomerative ViT-Bs.</i>

</div> <br/>

An extended family of versatile efficient vision encoders producing high-quality features and achieving outstanding performance on various vision tasks including image understanding, dense prediction and vision-language modeling.

Pretrained models

:information_source: Please follow the link provided below to get access to all the model weights. These URLs can then be used to download the model weights to a local filesystem and point torch.hub.load() to these local weights via the weights parameters.

See the example code snippets below.

ViT models pretrained on web dataset (LVD-1689M):

<table style="margin: auto"> <thead> <tr> <th>Model</th> <th>Parameters</th> <th>Download</th> </tr> </thead> <tbody> <tr> <td>ViT-T/16</td> <td align="right">6M</td> <td align="center"><a href="https://huggingface.co/facebook/EUPE-ViT-T/">[link]</a></td> </tr> <tr> <td>ViT-S/16</td> <td align="right">21M</td> <td align="center"><a href="https://huggingface.co/facebook/EUPE-ViT-S/">[link]</a></td> </tr> <tr> <td>ViT-B/16</td> <td align="right">86M</td> <td align="center"><a href="https://huggingface.co/facebook/EUPE-ViT-B/">[link]</a></td> </tr> </tbody> </table>

ConvNeXt models pretrained on web dataset (LVD-1689M):

<table style="margin: auto"> <thead> <tr> <th>Model</th> <th>Parameters</th> <th>Download</th> </tr> </thead> <tbody> <tr> <td>ConvNeXt Tiny</td> <td align="right">29M</td> <td align="center"><a href="https://huggingface.co/facebook/EUPE-ConvNeXt-T/">[link]</a></td> </tr> <tr> <td>ConvNeXt Small</td> <td align="right">50M</td> <td align="center"><a href="https://huggingface.co/facebook/EUPE-ConvNeXt-S/">[link]</a></td> </tr> <tr> <td>ConvNeXt Base</td> <td align="right">89M</td> <td align="center"><a href="https://huggingface.co/facebook/EUPE-ConvNeXt-B/">[link]</a></td> </tr> </tbody> </table>

Pretrained backbones (via PyTorch Hub)

Please follow the instructions here to install PyTorch (the only required dependency for loading the model). Installing PyTorch with CUDA support is strongly recommended.

import torch

REPO_DIR = <PATH/TO/A/LOCAL/DIRECTORY/WHERE/THE/EUPE/REPO/WAS/CLONED>

# EUPE ViT models pretrained on web images
eupe_vitt16 = torch.hub.load(REPO_DIR, 'eupe_vitt16', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
eupe_vits16 = torch.hub.load(REPO_DIR, 'eupe_vits16', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
eupe_vitb16 = torch.hub.load(REPO_DIR, 'eupe_vitb16', source='local', weights=<CHECKPOINT/URL/OR/PATH>)

# EUPE ConvNeXt models pretrained on web images
eupe_convnext_tiny = torch.hub.load(REPO_DIR, 'eupe_convnext_tiny', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
eupe_convnext_small = torch.hub.load(REPO_DIR, 'eupe_convnext_small', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
eupe_convnext_base = torch.hub.load(REPO_DIR, 'eupe_convnext_base', source='local', weights=<CHECKPOINT/URL/OR/PATH>)

Image transforms

Please use the following transform (standard ImageNet evaluation transform):

import torchvision
from torchvision.transforms import v2

def make_transform(resize_size: int = 256):
    to_tensor = v2.ToImage()
    resize = v2.Resize((resize_size, resize_size), antialias=True)
    to_float = v2.ToDtype(torch.float32, scale=True)
    normalize = v2.Normalize(
        mean=(0.485, 0.456, 0.406),
        std=(0.229, 0.224, 0.225),
    )
    return v2.Compose([to_tensor, resize, to_float, normalize])

Installation

The training and evaluation code requires PyTorch version >= 2.7.1 as well as a few other 3rd party packages. Note that the code has only been tested with the specified versions and also expects a Linux environment. To setup all the required dependencies for training and evaluation, please follow the instructions below:

micromamba (Recommended) - Clone the repository and then create and activate a eupe conda environment using the provided environment definition:

micromamba env create -f conda.yaml
micromamba activate eupe

Data preparation

ADE20K

Create a folder to host the ADE20K dataset for example:

export DATASETS_ROOT=${HOME}/datasets
mkdir -p ${DATASETS_ROOT}/ADE20K
with-proxy wget http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip
unzip ADEChallengeData2016.zip -d ${DATASETS_ROOT}/ADE20K

After untarring the data file, the directory structure should be similar to the following,

the training images:

images/training/ADE_train_00000001.jpg
images/training/ADE_train_00000002.jpg
    ...
images/training/ADE_train_00020210.jpg

the corresponding annotation masks for the training images:

annotations/training/ADE_train_00000001.png
annotations/training/ADE_train_00000002.png
    ...
annotations/training/ADE_train_00020210.png

the validation images:

images/validation/ADE_val_00000001.jpg
images/validation/ADE_val_00000002.jpg
    ...
images/validation/ADE_val_00002000.jpg

the corresponding annotation masks for the validation images:

annotations/validation/ADE_val_00000001.png
annotations/validation/ADE_val_00000002.png
    ...
annotations/validation/ADE_val_00002000.png

Note: annotations masks contain labels ranging from 0 to 150, where 0 refers to "other objects". We do not consider those pixels in our evaluation.

objectInfo150.txt contains the information about the labels of the 150 semantic categories, including indices, pixel ratios and names.

NYUv2 Depth

Create a folder to host the NYU dataset for example:

export DATASETS_ROOT=${HOME}/datasets
mkdir -p ${DATASETS_ROOT}/NYU

We use the NYU subset extracted by BTS from the 120k samples of the original NYU raw dataset.

Option 1 -- Follow BTS's instructions

Please follow BTS instructions to create the dataset:

train set
test set.

Make sure you also download the train and test splits:

wget https://github.com/cleinc/bts/blob/master/train_test_inputs/nyudepthv2_train_files_with_gt.txt -O ${DEPTH_DATASETS_ROOT}/NYU/nyu_train.txt
wget https://github.com/cleinc/bts/blob/master/train_test_inputs/nyudepthv2_test_files_with_gt.txt -O ${DEPTH_DATASETS_ROOT}/NYU/nyu_test.txt

Option 2 (preferred) -- Download the readily availble dataset from BinsFormer

Alternatively, one can download the dataset from the following Google Drive link. If the Google Drive link is not available anymore, try Option 1.

Expected contents:

$DEPTH_DATASETS_ROOT/NYU/basement/[...]
$DEPTH_DATASETS_ROOT/NYU/basement_0001a/[...]
$DEPTH_DATASETS_ROOT/NYU/basement_0001b/[...]
$DEPTH_DATASETS_ROOT/NYU/bathroom/[...]
$DEPTH_DATASETS_ROOT/NYU/[...]
$DEPTH_DATASETS_ROOT/NYU/study_room_0004/[...]
$DEPTH_DATASETS_ROOT/NYU/study_room_0005a/[...]
$DEPTH_DATASETS_ROOT/NYU/study_room_0005b/[...]
$DEPTH_DATASETS_ROOT/NYU/nyu_test.txt
$DEPTH_DATASETS_ROOT/NYU/nyu_train.txt

Note: if data is downloaded with Option 2 make sure to rename nyu into NYU.

ImageNet-1k

The root directory of the dataset should hold the following contents:

<ROOT>/test/ILSVRC2012_test_00000001.JPEG
<ROOT>/test/[..]
<ROOT>/test/ILSVRC2012_test_00100000.JPEG
<ROOT>/train/n01440764/n01440764_10026.JPEG
<ROOT>/train/[...]
<ROOT>/train/n15075141/n15075141_9993.JPEG
<ROOT>/val/n01440764/ILSVRC2012_val_00000293.JPEG
<ROOT>/val/[...]
<ROOT>/val/n15075141/ILSVRC2012_val_00049174.JPEG
<ROOT>/labels.txt

The provided dataset implementation expects a few additional metadata files to be present under the extra directory:

<EXTRA>/class-ids-TRAIN.npy
<EXTRA>/class-ids-VAL.npy
<EXTRA>/class-names-TRAIN.npy
<EXTRA>/class-names-VAL.npy
<EXTRA>/entries-TEST.npy
<EXTRA>/entries-TRAIN.npy
<EXTRA>/entries-VAL.npy

These metadata files can be generated (once) with the following lines of Python code:

from eupe.data.datasets import ImageNet

for split in ImageNet.Split:
    dataset = ImageNet(split=split, root="<ROOT>", extra="<EXTRA>")
    dataset.dump_extra()

Note that the root and extra directories do not have to be distinct directories.

Evaluation

In order to evaluate the model, run the following evaluation on a single node:

Linear segmentation with data augmentation on ADE20K

PYTHONPATH=. python eupe/eva

EUPE

Install / Use

README