Salad
Optimal Transport Aggregation for Visual Place Recognition
Install / Use
/learn @serizba/SaladREADME
I k
Optimal Transport Aggregation for Visual Place Recognition
Sergio Izquierdo, Javier Civera
Code and models for Optimal Transport Aggregation for Visual Place Recognition (DINOv2 SALAD).
Summary
We introduce DINOv2 SALAD, a Visual Place Recognition model that achieves state-of-the-art results on common benchmarks. We introduce two main contributions:
- Using a finetuned DINOv2 encoder to get richer and more powerful features.
- A new aggregation technique based on optimal transport to create a global descriptor based on optimal transport. This aggregation extends NetVLAD to consider feature-to-cluster relations as well as cluster-to-features. Besides, it includes a dustbin to discard uninformative features.
For more details, check the paper at arXiv.

Setup
It has been tested on Pytorch 2.1.0 with CUDA 12.1 and Xformers. Create a ready to run environment with:
conda env create -f environment.yml
To quickly test and use our model, you can use Torch Hub:
import torch
model = torch.hub.load("serizba/salad", "dinov2_salad")
model.eval()
model.cuda()
Dataset
For training, download GSV-Cities dataset. For evaluation download the desired datasets (MSLS, NordLand, SPED, or Pittsburgh)
Train
Training is done on GSV-Cities for 4 complete epochs. It requires around 30 minutes on an NVIDIA RTX 3090. For training DINOv2 SALAD run:
python3 main.py
After training, logs and checkpoints should be on the logs dir.
Evaluation
You can download a pretrained DINOv2 SALAD model from here:
<table> <thead> <tr> <th>Model Name</th> <th>Descriptor size</th> <th>Download link</th> </tr> </thead> <tbody> <tr> <td>dino_salad</td> <td>8192+256</td> <td> <a href="https://drive.google.com/file/d/1u83Dmqmm1-uikOPr58IIhfIzDYwFxCy1/view?usp=sharing">download</a></td> </tr> <tr> <td>dino_salad_512_32</td> <td>512 + 32</td> <td> <a href="https://drive.google.com/file/d/18SljgYj0mErBvuMoVYSJpI9BIQ7aDWDB/view?usp=sharing">download</a></td> </tr> <tr> <td>dino_salad_2048_64</td> <td>2048+64</td> <td> <a href="https://drive.google.com/file/d/1g0T5kCHfV6T-V1GWA1BlVGZb2KWzGIty/view?usp=sharing">download</a></td> </tr> </tbody> </table>For evaluating run:
python3 eval.py --ckpt_path 'weights/dino_salad.ckpt' --image_size 322 322 --batch_size 256 --val_datasets MSLS Nordland
<table>
<thead>
<tr>
<th colspan="3">MSLS Challenge</th>
<th colspan="3">MSLS Val</th>
<th colspan="3">NordLand</th>
</tr>
<tr>
<th>R@1</th>
<th>R@5</th>
<th>R@10</th>
<th>R@1</th>
<th>R@5</th>
<th>R@10</th>
<th>R@1</th>
<th>R@5</th>
<th>R@10</th>
</tr>
</thead>
<tbody>
<tr>
<td>75.0</td>
<td>88.8</td>
<td>91.3</td>
<td>92.2</td>
<td>96.4</td>
<td>97.0</td>
<td>76.0</td>
<td>89.2</td>
<td>92.0</td>
</tr>
</tbody>
</table>
Acknowledgements
This code is based on the amazing work of:
Cite
Here is the bibtex to cite our paper
@InProceedings{Izquierdo_CVPR_2024_SALAD,
author = {Izquierdo, Sergio and Civera, Javier},
title = {Optimal Transport Aggregation for Visual Place Recognition},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
}
