VICI: VLM-Instructed Cross-view Image-localisation

<p align="middle"> <a href="https://zxh009123.github.io/">Xiaohan Zhang*</a> <a href="https://tavisshore.co.uk/">Tavis Shore*</a> <a href="">Chen Chen</a> <br> <a href="https://cvssp.org/Personal/OscarMendez/index.html">Oscar Mendez</a> <a href="https://personalpages.surrey.ac.uk/s.hadfield/biography.html">Simon Hadfield</a> <a href="https://www.uvm.edu/cems/cs/profile/safwan-wshah">Safwan Wshah</a> </p> <p align="middle"> <a href="https://www.wshahaigroup.com/">Vermont Artificial Intelligence Laboratory (VaiL)</a> <br> <a href="https://www.surrey.ac.uk/centre-vision-speech-signal-processing">Centre for Vision, Speech, and Signal Processing (CVSSP)</a> <br> <a href="https://www.ucf.edu/">University of Central Florida</a> <a href="https://locusrobotics.com/">Locus Robotics</a> </p>

vici_diagram

</div>

📓 Description

🧬 Feature Extractors

| Backbone | Params (M) | FLOPs (G) | Dims | R@1 | R@5 | R@10 | |:----------:|:----------:|:---------:|:----:|:-----:|:-----:|:-----:| | ConvNeXt-T | 28 | 4.5 | 768 | 1.36 | 4.34 | 7.95 | | ConvNeXt-B | 89 | 15.4 | 1024 | 3.14 | 8.14 | 13.22 | | ViT-B | 86 | 17.6 | 768 | 3.30 | 8.92 | 13.96 | | ViT-L | 307 | 60.6 | 1024 | 9.62 | 23.42 | 32.73 | | DINOv2-B | 86 | 152 | 768 | 17.37 | 36.14 | 46.96 | | DINOv2-L | 304 | 507 | 1024 | 27.49 | 51.96 | 63.13 |

</div>

🧰 Vision-Language Models

| VLM | R@1 | R@5 | R@10 | |:---------------------:|-------|-------|-------| | Without Re-ranking | 27.49 | 51.96 | 63.13 | | Gemini 2.5 Flash Lite | 23.54 | 48.39 | 63.13 | | Gemini 2.5 Flash | 30.21 | 53.04 | 63.13 |

</div>

🛸 Drone Augmentation

| $P$ | R@1 | R@5 | R@10 | |:---:|:-----:|:-----:|:-----:| | 0 | 24.47 | 48.16 | 60.99 | | 0.1 | 26.98 | 51.34 | 61.92 | | 0.3 | 27.49 | 51.96 | 63.13 | | 0.5 | 24.89 | 52.03 | 62.66 |

</div>

🎯 Ablation study and baseline comparison.

| Model | R@1 | R@5 | R@10 | |:---------------------------------:|-------|-------|-------| | U1652~\cite{zheng2020university} | 1.20 | - | - | | LPN w/o drone~\cite{wang2021each} | 0.74 | - | - | | LPN w/ drone~\cite{wang2021each} | 0.81 | - | - | | DINOv2-L | 24.66 | 48.00 | 59.02 | | + Drone Data | 27.49 | 51.96 | 63.13 | | + VLM Re-rank (Ours) | 30.21 | 53.04 | 63.13 |

</div>

📊 Evaluation

🐍 Environment Setup

conda env create -n ENV -f requirements.yaml && conda activate ENV

🐍 Stage 1 - Image Retrieval

Before running Stage 1, configure your dataset paths:

Navigate to the /config/ directory.
Open the default.yaml file (or copy it to a new file).
Replace the placeholder values (e.g., DATA_ROOT) with the actual paths to your dataset and related files.

Once your configuration file is ready, you can train Stage 1 using:

python stage_1.py --config YOUR_CONFIG_FILE_NAME

You can also download our pre-trained weights here.

🐍 Stage 2 - VLM Re-ranking

To run Stage 2, you need to:

Open the stage_2.py file.
Replace the relevant placeholders (e.g., the path to the answer file from Stage 1 and your Gemini API key).
Ensure any other required directories or options are correctly set.

Then, simply run:

python stage_2.py

This will perform re-ranking using a Vision-Language Model (VLM) on top of the initial retrieval results. There will be a LLM_re_ranked_answer.txt in the answer directory and a reasons.json containing all the reasons for re-ranking.