DIP
Official implementation of DIP: Unsupervised Dense In-Context Post-training of Visual Representations
Install / Use
/learn @sirkosophia/DIPREADME
<a href="https://vobecant.github.io/">Antonin Vobecky</a> <a href="https://abursuc.github.io/">Andrei Bursuc</a> <a href="https://thome.isir.upmc.fr">Nicolas Thome</a>
</h2> <p></p> <a href="https://arxiv.org/abs/2506.18463"><img src="https://img.shields.io/badge/arXiv-DIP-b31b1b.svg" height=25em></a>
Abstract
<em> We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by metalearning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world incontext scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations. </em>
Environment
git clone https://github.com/sirkosophia/DIP.git
cd dip
conda create -n dip python=3.10.13 -y -c conda-forge
conda activate dip
pip install -r requirements_dip.txt
Datasets
See Preparing Datasets for DIP for details on how to download the datasets.
Pseudo-labels
Download our COCO dense pseudo labels by running the following commands:
mkdir masks
cd masks
wget https://huggingface.co/datasets/SophiaSirko/DIP_COCO_pseudolabels/resolve/main/dip_COCO_masks.zip
wget https://huggingface.co/datasets/SophiaSirko/DIP_COCO_pseudolabels/resolve/main/dip_COCO_masks_base.zip
unzip dip_COCO_masks.zip
unzip dip_COCO_masks_base.zip
rm dip_COCO_masks.zip
rm dip_COCO_masks_base.zip
Post-training
To post-train DINOv2R ViT small on COCO dataset execute the following command:
torchrun posttraindip.py --config configs/dip_coco.yaml
To post-train DINOv2R ViT base on COCO dataset execute the following command:
torchrun posttraindip.py --config configs/dip_coco_base.yaml
Post-trained models
| Backbone | Method | PascalVOC | ADE20K | Link | |-----------|--------|-----------|--------|------| | ViT-S/14 | DINOv2R| 79.4 | 39.3 | | | ViT-S/14 | NeCo | 81.0 | 38.9 | | | ViT-S/14 | DIP | 81.0 | 39.7 | Download | |-----------|--------|-----------|--------|------| | ViT-B/14 | DINOv2R| 79.0 | 40.8 | | | ViT-B/14 | NeCo | 82.4 | 41.2 | | | ViT-B/14 | DIP | 82.1 | 42.6 | Download |
Download Post-trained Weights
# Create the output directory if it doesn't exist
mkdir -p output
wget https://github.com/your-username/your-repo/releases/download/v1.0/dip_coco_basecheckpoint-4.pth -O output/dip_coco_basecheckpoint-4.pth
wget https://github.com/your-username/your-repo/releases/download/v1.0/dip_coco_smallcheckpoint-4.pth -O output/dip_coco_smallcheckpoint-4.pth
Evaluation
PascalVOC:
python hummingbird/launch_humm.py -n oneshot -ae 2 -dn voc -ms 10240000 -is 504 --beta 0.07 -bs 2 -ib small -mlpout 6144 -mlpr 7 -mw output/dip_coco_smallcheckpoint-4.pth
python hummingbird/launch_humm.py -n oneshot -ae 2 -dn voc -ms 10240000 -is 504 --beta 0.07 -bs 2 -ib base -mlpout 6144 -mlpr 7 -mw output/dip_coco_basecheckpoint-4.pth
ADE20K:
python hummingbird/launch_humm.py -n oneshot -ae 2 -dn ade20k -ms 10240000 -is 504 --beta 0.07 -bs 2 -ib small -mlpout 6144 -mlpr 7 -mw output/dip_coco_smallcheckpoint-4.pth
python hummingbird/launch_humm.py -n oneshot -ae 2 -dn ade20k -ms 10240000 -is 504 --beta 0.07 -bs 2 -ib base -mlpout 6144 -mlpr 7 -mw output/dip_coco_basecheckpoint-4.pth
Citation
@misc{sirkogalouchenko2025dipunsuperviseddenseincontext,
title={DIP: Unsupervised Dense In-Context Post-training of Visual Representations},
author={Sophia Sirko-Galouchenko and Spyros Gidaris and Antonin Vobecky and Andrei Bursuc and Nicolas Thome},
year={2025},
eprint={2506.18463},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.18463},
}
Acknowledgements
This repo relies on the following projects:
Reproduction of Towards In-context Scene Understanding
CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion
Related Skills
node-connect
353.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
353.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
353.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
