GiT
[ECCV2024 Oral๐ฅ] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"
Install / Use
/learn @Haiyang-W/GiTREADME
The first GPT-style general vision model unifies various vision tasks only with a vanilla ViT. No negative transfer.
<h5 align="center"> <!-- [](https://huggingface.co/spaces/LanguageBind/GiT) [](https://replicate.com/camenduru/GiT) [](https://colab.research.google.com/github/camenduru/GiT-jupyter/blob/main/MoE_LLaVA_jupyter.ipynb) [](https://huggingface.co/papers/2401.15947) --> <!-- [](https://www.youtube.com/watch?v=uYb38g-weEY) [](https://mp.weixin.qq.com/s/ICylR6n2LhqQRS0CAHFI1A) --> </h5>This repo is the official implementation of ECCV2024 <font color=Red>Oral</font> paper: GiT: Towards Generalist Vision Transformer through Universal Language Interface as well as the follow-ups. We have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.
<div align="center"> <img src="assets/Figure1.png" width="800"/> </div>GiT: Towards Generalist Vision Transformer through Universal Language Interface
Haiyang Wang*, Hao Tang*, Li Jiang $^\dagger$, Shaoshuai Shi, Muhammad Ferjad Naeem, Hongsheng Li, Bernt Schiele, Liwei Wang $^\dagger$
- Primary contact: Haiyang Wang ( wanghaiyang6@stu.pku.edu.cn ), Hao Tang ( tanghao@stu.pku.edu.cn )
๐ฃ News
- [24-8-12] ๐ค Our GiT was accepted by ECCV2024 with <font color=Red>Oral</font> presentation.
- [24-7-01] ๐ค Our GiT was accepted by ECCV2024.
- [24-3-15] ๐ Training and inference Code is released.
- [24-3-15] ๐ GiT is released on arXiv.
๐ซ What we want to do
The Model Architectures across various AI domains are converging towards <font color=Red>Multi-Layer Plain Transformers</font>.
- Language Modeling (GPT)
- 2D Image Modeling (ViT)
- 3D Point Cloud Modeling (DSVT)
- 2D Image and 3D Point Cloud Joint Modeling (UniTR)
- Graph Modeling (Graphormer)
- $\cdot \cdot \cdot$
Reducing Human Bias in Model Architecture Designing
We aim to unify the model architecture of vision and language through a plain transformer, reducing human biases such as modality-specific encoders and task-specific heads. A key advancement in deep learning is the shift from hand-crafted to autonomously learned features, inspiring us to reduce human-designed aspects in architecture. Moreover, benefiting from the flexibility of plain transformers, our framework can extend to more modalities like point clouds and graphs.
๐ค What we achieve
Building a universal computation model across all tasks stands as the cornerstone of artificial intelligence, reducing the need for task-specific designs. In this project, we introduce GiT (Generalist Vision Transformer). GiT has the following characteristics:
- ๐ฎ Minimalist architecture design similar to LLM: GiT consists solely of a single transformer, without the inclusion of additional vision encoders and adapters.
- ๐ Covering all types of visual understanding tasks: GiT addresses a spectrum of visual tasks, including object-level tasks (e.g., object detection), pixel-level tasks (e.g., semantic segmentation), and vision-language tasks (e.g., image captioning).
- ๐ค Achieving multi-task ability by unified language interface: Similar to LLM, GiT observes the task synergy effect in multi-task training. It fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. No negative transfer phenomenon.
- ๐ฅ Strong performance on zero-shot and few-shot benchmark: GiT scales well with model size and data, demonstrating remarkable generalizability across diverse scenarios after training on 27 datasets.
- ๐ Simple one-stage training strategy: GiT uses a very simple one-stage training strategy, fully embracing the training style utilized by the current LLM framework.
Overview
- ๐ซ What we want to do
- ๐ค Introduction
- ๐ Main Results
- ๐ ๏ธ Quick Start
- ๐ Todo
- ๐ Acknowledgments
- ๐ Citation
๐ Main Results
Single-Task Benchmark
| Model |Params| Metric | Perfomance |ckpt|log|config| |---------|---------|---------|--------|--------|---------|---------| | GiT-B<sub>detection</sub> | 131M|mAP|45.1 | ckpt|log| config| | GiT-B<sub>insseg</sub> | 131M|mAP|31.4 |ckpt|log| config | | GiT-B<sub>semseg</sub> | 131M|mIoU|47.7 |ckpt|log| config | | GiT-B<sub>caption</sub>| 131M|BLEU-4|33.7 | ckpt|log| config | | GiT-B<sub>grounding</sub>| 131M|Acc@0.5|83.3 | ckpt|log| config |
Multi-Tasking Benchmark
| Model |Params| Detection | Ins Seg| Sem Seg |Caption |Grounding |ckpt|log|config| |---------|---------|---------|--------|--------|---------|---------|---------|---------|---------| | GiT-B<sub>multi-task</sub> | 131M|46.7 | 31.9 | 47.8 |35.3|85.8|ckpt|log| config | | GiT-L<sub>multi-task</sub> | 387M|51.3 | 35.1 | 50.6|35.7|88.4|ckpt|log| config | | GiT-H<sub>multi-task</sub>| 756M|52.9 | 35.8 | 52.4|36.2|89.2|ckpt|log| config |
<!-- | GiT-B<sub>single-task</sub> | 131M|45.1 | 31.4| 47.7 |33.7|83.3|[ckpt](https://huggingface.co/kanashi6/GiT/blob/main/det_base.pth)|[log](https://huggingface.co/kanashi6/GiT/blob/main/det_base.log)| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/single_detection_base.py)| -->Task Synergy in Multi-Tasking Training
| Model |Params| Detection | Ins S
