The first GPT-style general vision model unifies various vision tasks only with a vanilla ViT. No negative transfer.

</h5>

This repo is the official implementation of ECCV2024 Oral paper: GiT: Towards Generalist Vision Transformer through Universal Language Interface as well as the follow-ups. We have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Haiyang Wang*, Hao Tang*, Li Jiang $^\dagger$, Shaoshuai Shi, Muhammad Ferjad Naeem, Hongsheng Li, Bernt Schiele, Liwei Wang $^\dagger$

Primary contact: Haiyang Wang ( wanghaiyang6@stu.pku.edu.cn ), Hao Tang ( tanghao@stu.pku.edu.cn )

📣 News

[24-8-12] 🤗 Our GiT was accepted by ECCV2024 with Oral presentation.
[24-7-01] 🤗 Our GiT was accepted by ECCV2024.
[24-3-15] 🚀 Training and inference Code is released.
[24-3-15] 👀 GiT is released on arXiv.

💫 What we want to do

The Model Architectures across various AI domains are converging towards Multi-Layer Plain Transformers.

Language Modeling (GPT)
2D Image Modeling (ViT)
3D Point Cloud Modeling (DSVT)
2D Image and 3D Point Cloud Joint Modeling (UniTR)
Graph Modeling (Graphormer)
$\cdot \cdot \cdot$

Reducing Human Bias in Model Architecture Designing

We aim to unify the model architecture of vision and language through a plain transformer, reducing human biases such as modality-specific encoders and task-specific heads. A key advancement in deep learning is the shift from hand-crafted to autonomously learned features, inspiring us to reduce human-designed aspects in architecture. Moreover, benefiting from the flexibility of plain transformers, our framework can extend to more modalities like point clouds and graphs.

🤔 What we achieve

Building a universal computation model across all tasks stands as the cornerstone of artificial intelligence, reducing the need for task-specific designs. In this project, we introduce GiT (Generalist Vision Transformer). GiT has the following characteristics:

😮 Minimalist architecture design similar to LLM: GiT consists solely of a single transformer, without the inclusion of additional vision encoders and adapters.
🚀 Covering all types of visual understanding tasks: GiT addresses a spectrum of visual tasks, including object-level tasks (e.g., object detection), pixel-level tasks (e.g., semantic segmentation), and vision-language tasks (e.g., image captioning).
🤗 Achieving multi-task ability by unified language interface: Similar to LLM, GiT observes the task synergy effect in multi-task training. It fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. No negative transfer phenomenon.
🔥 Strong performance on zero-shot and few-shot benchmark: GiT scales well with model size and data, demonstrating remarkable generalizability across diverse scenarios after training on 27 datasets.
👍 Simple one-stage training strategy: GiT uses a very simple one-stage training strategy, fully embracing the training style utilized by the current LLM framework.

Overview

🚀 Main Results

Single-Task Benchmark

| Model |Params| Metric | Perfomance |ckpt|log|config| |---------|---------|---------|--------|--------|---------|---------| | GiT-Bdetection | 131M|mAP|45.1 | ckpt|log| config| | GiT-Binsseg | 131M|mAP|31.4 |ckpt|log| config | | GiT-Bsemseg | 131M|mIoU|47.7 |ckpt|log| config | | GiT-Bcaption| 131M|BLEU-4|33.7 | ckpt|log| config | | GiT-Bgrounding| 131M|Acc@0.5|83.3 | ckpt|log| config |

Multi-Tasking Benchmark

| Model |Params| Detection | Ins Seg| Sem Seg |Caption |Grounding |ckpt|log|config| |---------|---------|---------|--------|--------|---------|---------|---------|---------|---------| | GiT-Bmulti-task | 131M|46.7 | 31.9 | 47.8 |35.3|85.8|ckpt|log| config | | GiT-Lmulti-task | 387M|51.3 | 35.1 | 50.6|35.7|88.4|ckpt|log| config | | GiT-Hmulti-task| 756M|52.9 | 35.8 | 52.4|36.2|89.2|ckpt|log| config |

Task Synergy in Multi-Tasking Training

| Model |Params| Detection | Ins S

GiT

Install / Use

README