GVRT
[ECCV-2022]Grounding Visual Representations with Texts for Domain Generalization
Install / Use
/learn @seonwoo-min/GVRTREADME
Grounding Visual Representations with Texts (GVRT)
Grounding Visual Representations with Texts for Domain Generalization
Seonwoo Min, Nokyung Park, Siwon Kim, Seunghyun Park, Jinkyu Kim
ECCV 2022 | Official Pytorch implementation
We advocate for leveraging the vision-and-language cross-modality supervision for the DG task.
- Two modules to ground visual representations with texts containing typical reasoning of humans.
- Visual and Textual Joint Embedder aligns visual representations with the pivot sentence embedding.
- Textual Explanation Generator generates explanations justifying the rationale behind its decision.
- Our method achieves state-of-the-art results both in CUB-DG and DomainBed benchmarks!

Installation
We recommend creating a conda environment and installing the necessary python packages as:
git clone https://github.com/mswzeus/GVRT.git
cd GVRT
ln -s ../src DomainBed_GVRT/src
conda create -n GVRT python=3.8
conda activate GVRT
conda install pytorch==1.10.2 torchvision==0.11.3 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt
CUB-DG Benchmark Dataset
We created CUB-DG to investigate the cross-modality supervision in the DG task (<a href="https://drive.google.com/file/d/1BU8Jy0a1mdNCbIpUUBrQPqQfNXGXfm1f/view?usp=sharing">Download Link</a>).
CUB is an image dataset with photos of 200 bird species. For more information, please see the <a href="http://www.vision.caltech.edu/visipedia/CUB-200.html">original repo</a>.
We used pre-trained style transfer models to obtain images from three other domains, i.e. Art, Paint, and Cartoon.
- Photo-to-Art: <a href="https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix">CycleGAN (Monet)</a>
- Photo-to-Paint: <a href="https://github.com/jiupinjia/stylized-neural-painting">Stylized-Neral-Painting (Watercolor)</a>
- Photo-to-Cartoon: <a href="https://github.com/SystemErrorWang/White-box-Cartoonization">White-Box-Cartoonization model</a>

Pre-trained Models
We provide the following pre-trained models for three independent runs (<a href="https://drive.google.com/file/d/11CbVRWlSHWd2HPkBkp2ZanUVBFFau8Dx/view?usp=sharing">Download Link</a>).
- Ours trained with PTE (pre-trained textual encoder)
- Ours trained with STE (self-supervised textual encoder)
How to Run
Training a GVRT model
You can use the <code>train_model.py</code> script with the necessary configurations as:
CUDA_VISIBLE_DEVICES=0 python train_model.py --algorithm GVRT --test-env 0 --seed 0 --output-path results/PTE_test0_seed0
Evaluating a GVRT model
You can use the <code>evaluate_model.py</code> script with the necessary configurations as:
CUDA_VISIBLE_DEVICES=0 python evaluate_model.py --algorithm GVRT --test-env 0 --seed 0 --output-path results/PTE_test0_seed0 --checkpoint pretrained_models/PTE_test0_seed0.pt
Experimental Results on CUB-DG
We report averaged results across three independent runs.
<img src="./docs/main_results.png" width="60%">Citation
If you find our work useful, please kindly cite this paper:
@article{min2022grounding,
author = {Seonwoo Min and Nokyung Park and Siwon Kim and Seunghyun Park and Jinkyu Kim},
title = {Grounding Visual Representations with Texts for Domain Generalization},
journal = {arXiv},
volume = {abs/2207.10285},
year = {2022},
url = {https://arxiv.org/abs/2207.10285}
}
Related Skills
node-connect
352.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
