Pixio
Pixio: a capable vision encoder dedicated to dense prediction, simply by pixel reconstruction
Install / Use
/learn @facebookresearch/PixioREADME
Official implementation of Pixio from the paper In Pursuit of Pixel Supervision for Visual Pre-training.
Lihe Yang, Shang-Wen Li, Yang Li, Xinjie Lei, Dong Wang, Abdelrahman Mohamed, Hengshuang Zhao, Hu Xu
[BibTeX]
Pixio is largely built on MAE, with three minimal yet critical algorithm updates:
- deeper decoder
- larger masking granularity
- more class tokens
Pixio also updates MAE's pre-training data from ImageNet-1K to MetaCLIP-2B with a simple self-curation strategy.
<p align="left"> <img src="./assets/pixio.png" width=90% height=90% class="center"> </p>Performance
Monocular depth estimation ($\delta_1 \uparrow$, frozen encoder)
| Method | ViT | #Params | NYUv2 (DPT head) | KITTI (DPT head) | NYUv2 (linear head) | KITTI (linear head) | | :-------- | ----: | ------: | :--------------: | :--------------: | :-----------------: | :-----------------: | | MAE | H/14 | 631M | 80.8 | 90.9 | 70.3 | 79.4 | | DINOv2 | g/14 | 1137M | 90.1 | 94.6 | 75.3 | 78.1 | | DINOv3 | H+/16 | 841M | 93.2 | 95.6 | 76.3 | 73.2 | | Pixio | H/16 | 631M | 95.5 | 96.7 | 90.8 | 90.3 |
Feed-forward 3D reconstruction (MapAnything, ScanNet++ v2)
| Method | ViT | #Params | Scale (rel $\downarrow$) | Points (rel $\downarrow$) | Points ($\tau \uparrow$) | Pose (auc5 $\uparrow$) | Depth (rel $\downarrow$) | Depth ($\tau \uparrow$) | | :-------- | ----: | ------: | :----------------------: | :-----------------------: | :----------------------: | :--------------------: | :-----------------------: | :-------------------------: | | MAE | H/14 | 631M | 0.050 | 0.057 | 63.3 | 65.6 | 0.058 | 55.4 | | DINOv2 | L/14 | 304M | 0.041 | 0.052 | 67.6 | 73.2 | 0.052 | 60.6 | | DINOv3 | H+/16 | 841M | 0.035 | 0.051 | 69.0 | 68.5 | 0.051 | 61.2 | | Pixio | H/16 | 631M | 0.029 | 0.041 | 78.8 | 80.5 | 0.042 | 72.4 |
Semantic segmentation (mIoU $\uparrow$, frozen encoder)
| Method | ViT | #Params | ADE20K (DPT) | VOC (DPT) | LoveDA (DPT) | ADE20K (linear) | VOC (linear) | LoveDA (linear) | | :-------- | ----: | ------: | :----------: | :-------: | :----------: | :-------------: | :----------: | :-------------: | | MAE | H/14 | 631M | 37.6 | 76.0 | 50.2 | 35.2 | 70.8 | 47.6 | | DINOv2 | g/14 | 1137M | 51.5 | 85.2 | 55.0 | 49.0 | 81.8 | 51.9 | | DINOv3 | H+/16 | 841M | 52.3 | 85.6 | 55.3 | 50.3 | 82.1 | 52.7 | | Pixio | H/16 | 631M | 53.6 | 85.9 | 54.7 | 50.2 | 82.2 | 53.9 |
Installation
This codebase is developed with PyTorch 2.8.0 + CUDA 12.8.
conda create -n pixio python=3.10.18
conda activate pixio
pip install -r requirements.txt
Inference (may need Huggingface login)
You can either use source code from this repo or call Transformers APIs.
Source Code
Pixio ViT models pre-trained on web-scale dataset (MetaCLIP-2B):
<table style="margin: auto"> <thead> <tr> <th>Model</th> <th>Parameters</th> <th>Pre-training Dataset</th> <th>Download</th> </tr> </thead> <tbody> <tr> <td>Pixio-B/16</td> <td align="right">86M</td> <td align="center">MetaCLIP-2B</td> <td align="center"><a href="https://huggingface.co/facebook/pixio-vitb16/resolve/main/pixio_vitb16.pth">[link]</a></td> </tr> <tr> <td>Pixio-L/16</td> <td align="right">303M</td> <td align="center">MetaCLIP-2B</td> <td align="center"><a href="https://huggingface.co/facebook/pixio-vitl16/resolve/main/pixio_vitl16.pth">[link]</a></td> </tr> <tr> <td>Pixio-H/16</td> <td align="right">631M</td> <td align="center">MetaCLIP-2B</td> <td align="center"><a href="https://huggingface.co/facebook/pixio-vith16/resolve/main/pixio_vith16.pth">[link]</a></td> </tr> <tr> <td>Pixio-1B/16</td> <td align="right">1362M</td> <td align="center">MetaCLIP-2B</td> <td align="center"><a href="https://huggingface.co/facebook/pixio-vit1b16/resolve/main/pixio_vit1b16.pth">[link]</a></td> </tr> <tr> <td>Pixio-5B/16</td> <td align="right">5441M</td> <td align="center">MetaCLIP-2B</td> <td align="center"><a href="https://huggingface.co/facebook/pixio-vit5b16/resolve/main/pixio_vit5b16.pth">[link]</a></td> </tr> </tbody> </table>cd pixio
Then testing as follows:
from PIL import Image
from torchvision import transforms
from pixio import pixio_vith16
model = pixio_vith16(pretrained="your/checkpoint/path")
# you can try larger resolution, but ensure both sides are divisible by 16
transform = transforms.Compose([
transforms.Resize((256, 256), interpolation=3), # 3 is bicubic
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
])
img = Image.open("your/image/path").convert("RGB")
img = transform(img)
# block-wise features containing class tokens and patch tokens
features = model(img.unsqueeze(0))
Transformers (may need Huggingface login)
You can find all HuggingFace paths under this collection.
from transformers import AutoImageProcessor, AutoModel
from PIL import Image
img = Image.open("your/image/path")
processor = AutoImageProcessor.from_pretrained("facebook/pixio-vith16")
model = AutoModel.from_pretrained("facebook/pixio-vith16")
inputs = processor(images=img, return_tensors="pt")
outputs = model(**inputs, output_hidden_states=True)
features_norm = outputs.last_hidden_state # 8 class tokens + patch tokens after last LayerNorm
features = outputs.hidden_states[-1] # 8 class tokens + patch tokens before last LayerNorm
Pre-training
Data Preparation
We provide examples using ImageNet-1K and ImageNet-21K. We use ImageNet datasets organized as tar files from HuggingFace:
Launch Pre-training
cd pretraining
# specify your data path in the script
bash scripts/pretrain_pixio_vith16_imagenet.sh
Evaluation
We provide the evaluation code for monocular depth estimation (NYUv2, KITTI), semantic segmentation (ADE20K, Pascal VOC, LoveDA), and k-NN classification (ImageNet-1K).
Data Preparation
<details> <summary>Click here for details</summary>Monocular Depth Estimation
We follow ZoeDepth and BTS, preparing the data as follows:
- NYUv2: training set | validation set
- KITTI: images | annotations
Please organize the data as follows:
├── [Your NYUv2 Path]
├── sync
│ ├── basement_0001a
│ ├── bathroom_0001
│ └── ...
└── official_splits
└── test
├── bathroom
├── bedroom
└── ...
├── [Your KITTI Path]
├── images
│ ├── 2011_09_26
│ ├── 2011_09_28
│ └── ...
└── annotations # extracted from data_depth_annotated.zip
├── 2011_09_26_drive_0001_sync
├── 2011_09_26_drive_0002_sync
└── ...
Semantic Segmentation
We mainly follow UniMatch V2, preparing the data as follows:
- ADE20K: images | annotations
- Pascal: images | annotations
- LoveDA: data (run
evaluation/semseg/util/process_loveda.pyto conver
