DMN
CVPR2024: Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models
Install / Use
/learn @YBZh/DMNREADME
Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models
This repository provides the official PyTorch implementation of our CVPR 2024 paper:
[<ins>Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models</ins>]Paper
Authors: <ins>Yabin Zhang</ins>, <ins>Wenjie Zhu</ins>, <ins>Hui Tang</ins>, <ins>Zhiyuan Ma</ins>, <ins>Kaiyang Zhou</ins>, <ins>Lei Zhang</ins>
Overview
This repository contains the implementation of DMN for image classification with a pre-trained CLIP. We consider four task settings:
- Zero-shot classification in a test-time adaptation manner
- Few-shot classification
- Training-free few-shot classification
- Out-of-distribution generalization
Prerequisites
Hardware
This implementation is for the single-GPU configuration. All experiments can be reproduced on a GPU with more than 10GB memory (e.g., 1080Ti)!
Environment
The code is tested on PyTorch 1.13.1.
Datasets
We suggest downloading all datasets to a root directory (${data_root}), and renaming the directory of each dataset as suggested in ${ID_to_DIRNAME} in ./data/datautils.py. This would allow you to evaluate multiple datasets within the same run.
If this is not feasible, you could evaluate different datasets separately, and change the ${data_root} accordingly in the bash script.
For zero/few-shot classification, we consider 11 datasets:
For out-of-distribution generalization, we consider 4 datasets:
Run DMN
We provide a simple bash script under ./scripts/run.sh. You can modify the paths and other args in the script. One can easily reproduce all results by:
bash ./scripts/run.sh
For simplicity, we use set_id to denote different datasets. A complete list of set_id can be found in ${ID_to_DIRNAME} in ./data/datautils.py.
Main Results
Zero-shot Classification
<p align = "center"> <img src = "figures/zero-shot.png"> </p> <p align = "center"> </p>Few-shot Classification
<p align = "center"> <img src = "figures/few-shot.png"> </p> <p align = "center"> Few-shot classification results on 11 datasets with a VITB/16 image encoder. </p>Out-of-Distribution Generalization
<div align="center">| Method | ImageNet(IN) | IN-A | IN-V2 | IN-R | IN-Sketch | Average | OOD Average | |------------------|:--------:|:----------:|:-----------:|:----------:|:---------------:|:-------:|:-----------:| | CLIP-RN50 | 58.16 | 21.83 | 51.41 | 56.15 | 33.37 | 44.18 | 40.69 | | Ensembled prompt| 59.81 | 23.24 | 52.91 | 60.72 | 35.48 | 46.43 | 43.09 | | CoOp | 63.33 | 23.06 | 55.40 | 56.60 | 34.67 | 46.61 | 42.43 | | CoCoOp | 62.81 | 23.32 | 55.72 | 57.74 | 34.48 | 46.81 | 42.82 | | TPT | 60.74 | 26.67 | 54.70 | 59.11 | 35.09 | 47.26 | 43.89 | | DMN-ZS | 63.87 | 28.57 | 56.12 | 61.44 | 39.84 | 49.97 | 46.49 |
</div> <br />Citation
If you find our code useful or our work relevant, please consider citing:
@inproceedings{zhang2024dual,
title={Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models},
author={Zhang, Yabin and Zhu, Wenjie and Tang, Hui and Ma, Zhiyuan and Zhou, Kaiyang and Zhang, Lei},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
year={2024}
}
Acknowledgements
We thank the authors of CoOp/CoCoOp and TPT for their open-source implementation and instructions on data preparation.
Related Skills
node-connect
349.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
