SkillAgentSearch skills...

NeCo

"Near, far: Patch-ordering enhances vision foundation models' scene understanding": A New SSL Post-Training Approach for Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

Install / Use

/learn @vpariza/NeCo
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Near, far: Patch-ordering enhances vision foundation models' scene understanding

Valentinos Pariza*, Mohammadreza Salehi*,Gertjan J. Burghouts, Francesco Locatello, Yuki M. Asano

ICLR 2025

🌐 Project Page / ⌨️ GitHub Repository / 📄 Read the Paper on arXiv

Table of Contents

News

Thank you for using our code. Here we include news about changes in the repository.

  1. The repository has changed substantially to upgrade libraries to more recent libraries that speed up execution and reduce memory usage especially for the v2 ViT architecture that is used by Dinov2. The boost in the latter architecture comes by the use of xformers just like Dinov2 training.
  2. We updated the table below with new model entries and have added post-training config files, for dinov2, and dinov2r with and without the use of registers.
  3. We have clarified how each model is trained by explicitly providing a config file next to each model we post-trained in the table below.
  4. We have added code for linear segmentation for the Cityscapes Dataset.
  5. We cleared the code more and added more flexibility on what can be used during training via the configuration in files. More specifically added the following parameters:
    • eval_attn_maps (True/False) for specifying whether to evaluate the attention maps during training.
    • num_register_tokens (int: default to 0) for specifying whether to use registers and how much. Only works with architecture v2.

If you are interested for the legacy code, please look our github branch neco-1_x.

Introduction

NeCo introduces a new self-supervised learning technique for enhancing spatial representations in vision transformers. By leveraging Patch Neighbor Consistency, NeCo captures fine-grained details and structural information that are crucial for various downstream tasks, such as semantic segmentation.

<p align="center"> <img src="Images/Neco.jpg" alt="NeCo Overview" width="800"/> </p>

Key features of NeCo include:

  1. Patch-based neighborhood consistency
  2. Improved dense prediction capabilities
  3. Efficient training requiring only 19 GPU hours
  4. Compatibility with existing vision transformer backbone

Below is a table with some of our results on Pascal VOC 2012 based on DINOv2 backbone.

<table> <tr> <th>backbone</th> <th>arch</th> <th>params</th> <th>Overclustering k=500</th> <th>Dense NN Retrieval</th> <th>linear</th> <th colspan="2">download</th> <th>config</th> </tr> <tr> <td>DINOv2</td> <td>ViT-S/14</td> <td>21M</td> <td>57.7</td> <td>78.6</td> <td>81.4</td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-small/dinov2-architectures/neco_on_dinov2_vit14_model.ckpt">student</a></td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-small/dinov2-architectures/neco_on_dinov2_vit14_teacher.ckpt">teacher</a></td> <td><a href="./experiments/configs/models/small/neco_dinov2.yml">config</a></td> </tr> <tr> <td>DINOv2R-XR</td> <td>ViT-S/14</td> <td>21M</td> <td>72.6</td> <td>80.2</td> <td>81.3</td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-small/dinov2-architectures/neco_on_dinov2r_xr_vit14_model.ckpt">student</a></td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-small/dinov2-architectures/neco_on_dinov2r_xr_vit14_teacher.ckpt">teacher</a></td> <td><a href="./experiments/configs/models/small/neco_dinov2r_xr.yml">config</a></td> </tr> <tr> <td>DINOv2R</td> <td>ViT-S/14</td> <td>21M</td> <td>68.9</td> <td>80.7</td> <td>81.5</td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-small/dinov2-architectures/neco_on_dinov2r_vit14_model.ckpt">student</a></td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-small/dinov2-architectures/neco_on_dinov2r_vit14_teacher.ckpt">teacher</a></td> <td><a href="./experiments/configs/models/small/neco_dinov2r.yml">config</a></td> </tr> <tr> <td>DINOv2</td> <td>ViT-B/14</td> <td>85M</td> <td>71.1</td> <td>82.8</td> <td>84.5</td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-base/dinov2-architectures/neco_on_dinov2_vit14_model.ckpt">student</a></td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-base/dinov2-architectures/neco_on_dinov2_vit14_teacher.ckpt">teacher</a></td> <td><a href="./experiments/configs/models/base/neco_dinov2.yml">config</a></td> </tr> <tr> <td>DINOv2R-XR</td> <td>ViT-B/14</td> <td>85M</td> <td>71.8</td> <td>83.5</td> <td>83.3</td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-base/dinov2-architectures/neco_on_dinov2r_xr_vit14_model.ckpt">student</a></td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-base/dinov2-architectures/neco_on_dinov2r_xr_vit14_teacher.ckpt">teacher</a></td> <td><a href="./experiments/configs/models/base/neco_dinov2r_xr.yml">config</a></td> </tr> <tr> <td>DINOv2R</td> <td>ViT-S/14</td> <td>85M</td> <td>71.9</td> <td>82.9</td> <td>84.4</td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-base/dinov2-architectures/neco_on_dinov2r_vit14_model.ckpt">student</a></td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-base/dinov2-architectures/neco_on_dinov2r_vit14_teacher.ckpt">teacher</a></td> <td><a href="./experiments/configs/models/base/neco_dinov2r.yml">config</a></td> </tr> <tr> <td>DINO</td> <td>ViT-S/16</td> <td>22M</td> <td>47.9</td> <td>61.3</td> <td>65.8</td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-small/dino-architectures/neco_on_dino_vit16_model.ckpt">student</a></td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-small/dino-architectures/neco_on_dino_vit16_teacher.ckpt">teacher</a></td> <td><a href="./experiments/configs/models/small/neco_dino.yml">config</a></td> </tr> <tr> <td>TimeT</td> <td>ViT-S/16</td> <td>22M</td> <td>53.1</td> <td>66.5</td> <td>68.5</td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-small/dino-architectures/neco_on_timetuning_vit16_model.ckpt">student</a></td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-small/dino-architectures/neco_on_timetuning_vit16_teacher.ckpt">teacher</a></td> <td><a href="./experiments/configs/models/small/neco_timetuning.yml">config</a></td> </tr> <tr> <td>Leopart</td> <td>ViT-S/16</td> <td>22M</td> <td>55.3</td> <td>66.2</td> <td>68.3</td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-small/dino-architectures/neco_on_leopart_vit16_model.ckpt">student</a></td> <td><a href="https://huggingface.co/FunAILab/NeCo/resolve/main/vit-small/dino-architectures/neco_on_leopart_vit16_teacher.ckpt">teacher</a></td> <td><a href="./experiments/configs/models/small/neco_leopart.yml">config</a></td> </tr> </table>

In the following sections, we will delve into the training process, evaluation metrics, and provide instructions for using NeCo in your own projects.

GPU Requirements

Optimizing with our model, NeCo, does not necessitate a significant GPU budget. Our training process is conducted on a single NVIDIA A100 GPU.

Environment Setup

We use conda for dependency management. Please use environment.yml to install the environment necessary to run everything from our work. You can install it by running the following command:

conda env create -f environment.yaml

Or you can see the step by step process in the Installation Guide guide.

Pythonpath

Export the module to PYTHONPATH within the repository's parent directory. export PYTHONPATH="${PYTHONPATH}:$PATH_TO_REPO"

Neptune

We use neptune for logging experiments. Get you API token for neptune and insert it in the corresponding run-files. Also make sure to adapt the project name when setting up the logger.

Loading pretrained models

To use NeCo models on downstream dense prediction tasks, you just need to install timm and torch and depending on which checkpoint you use you can load it as follows:

The models can be download from our NeCo Hugging Face repo.

Models after post-training dinov2 (following dinov2 architecture)

NeCo on Dinov2

import torch
# change to dinov2_vitb14 for base as described in:
#    https://github.com/facebookresearch/dinov2
model =  torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14') 
path_to_checkpoint = "<your path to downloaded ckpt>"
state_dict = torch.load(path_to_checkpoint)
model.load_state_dict(state_dict, strict=False)

NeCo on Dinov2 with Re

View on GitHub
GitHub Stars30
CategoryDevelopment
Updated1mo ago
Forks3

Languages

Python

Security Score

90/100

Audited on Mar 3, 2026

No findings