GVRT

[ECCV-2022]Grounding Visual Representations with Texts for Domain Generalization

Generate Convert Improve

Install / Use

/learn @seonwoo-min/GVRT

About this skill

Quality Score

0/100

README

Grounding Visual Representations with Texts (GVRT)

Grounding Visual Representations with Texts for Domain Generalization
Seonwoo Min, Nokyung Park, Siwon Kim, Seunghyun Park, Jinkyu Kim
ECCV 2022 | Official Pytorch implementation

We advocate for leveraging the vision-and-language cross-modality supervision for the DG task.

Two modules to ground visual representations with texts containing typical reasoning of humans.
Visual and Textual Joint Embedder aligns visual representations with the pivot sentence embedding.
Textual Explanation Generator generates explanations justifying the rationale behind its decision.
Our method achieves state-of-the-art results both in CUB-DG and DomainBed benchmarks!

image-gvrt

Installation

We recommend creating a conda environment and installing the necessary python packages as:

git clone https://github.com/mswzeus/GVRT.git
cd GVRT
ln -s ../src DomainBed_GVRT/src
conda create -n GVRT python=3.8
conda activate GVRT
conda install pytorch==1.10.2 torchvision==0.11.3 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt

CUB-DG Benchmark Dataset

We created CUB-DG to investigate the cross-modality supervision in the DG task (<a href="https://drive.google.com/file/d/1BU8Jy0a1mdNCbIpUUBrQPqQfNXGXfm1f/view?usp=sharing">Download Link</a>).
CUB is an image dataset with photos of 200 bird species. For more information, please see the <a href="http://www.vision.caltech.edu/visipedia/CUB-200.html">original repo</a>.
We used pre-trained style transfer models to obtain images from three other domains, i.e. Art, Paint, and Cartoon.

Photo-to-Art: <a href="https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix">CycleGAN (Monet)</a>
Photo-to-Paint: <a href="https://github.com/jiupinjia/stylized-neural-painting">Stylized-Neral-Painting (Watercolor)</a>
Photo-to-Cartoon: <a href="https://github.com/SystemErrorWang/White-box-Cartoonization">White-Box-Cartoonization model</a>

image-cub-dg

Pre-trained Models

We provide the following pre-trained models for three independent runs (<a href="https://drive.google.com/file/d/11CbVRWlSHWd2HPkBkp2ZanUVBFFau8Dx/view?usp=sharing">Download Link</a>).

Ours trained with PTE (pre-trained textual encoder)
Ours trained with STE (self-supervised textual encoder)

How to Run

Training a GVRT model

You can use the <code>train_model.py</code> script with the necessary configurations as:

CUDA_VISIBLE_DEVICES=0 python train_model.py --algorithm GVRT --test-env 0 --seed 0 --output-path results/PTE_test0_seed0

Evaluating a GVRT model

You can use the <code>evaluate_model.py</code> script with the necessary configurations as:

CUDA_VISIBLE_DEVICES=0 python evaluate_model.py --algorithm GVRT --test-env 0 --seed 0 --output-path results/PTE_test0_seed0 --checkpoint pretrained_models/PTE_test0_seed0.pt

Experimental Results on CUB-DG

We report averaged results across three independent runs.

Citation

If you find our work useful, please kindly cite this paper:

@article{min2022grounding,
  author    = {Seonwoo Min and Nokyung Park and Siwon Kim and Seunghyun Park and Jinkyu Kim},
  title     = {Grounding Visual Representations with Texts for Domain Generalization},
  journal   = {arXiv},
  volume    = {abs/2207.10285},
  year      = {2022},
  url       = {https://arxiv.org/abs/2207.10285}
}

Related Skills

node-connect

352.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。