SkillAgentSearch skills...

GiT

[ECCV2024 Oral๐Ÿ”ฅ] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"

Install / Use

/learn @Haiyang-W/GiT
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

The first GPT-style general vision model unifies various vision tasks only with a vanilla ViT. No negative transfer.

<h5 align="center"> <!-- [![hf_space](https://img.shields.io/badge/๐Ÿค—-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/LanguageBind/GiT) [![Replicate demo and cloud API](https://replicate.com/camenduru/GiT/badge)](https://replicate.com/camenduru/GiT) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/camenduru/GiT-jupyter/blob/main/MoE_LLaVA_jupyter.ipynb) [![hf_space](https://img.shields.io/badge/๐Ÿค—-Paper%20In%20HF-red.svg)](https://huggingface.co/papers/2401.15947) --> <!-- [![youtube](https://img.shields.io/badge/-YouTube-000000?logo=youtube&logoColor=FF0000)](https://www.youtube.com/watch?v=uYb38g-weEY) [![jiqizhixin](https://img.shields.io/badge/-WeChat@ๆœบๅ™จไน‹ๅฟƒ-000000?logo=wechat&logoColor=07C160)](https://mp.weixin.qq.com/s/ICylR6n2LhqQRS0CAHFI1A) -->

arXiv License Hits GitHub issues GitHub closed issues
Twitter <br>

</h5>

This repo is the official implementation of ECCV2024 <font color=Red>Oral</font> paper: GiT: Towards Generalist Vision Transformer through Universal Language Interface as well as the follow-ups. We have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Haiyang Wang*, Hao Tang*, Li Jiang $^\dagger$, Shaoshuai Shi, Muhammad Ferjad Naeem, Hongsheng Li, Bernt Schiele, Liwei Wang $^\dagger$

  • Primary contact: Haiyang Wang ( wanghaiyang6@stu.pku.edu.cn ), Hao Tang ( tanghao@stu.pku.edu.cn )
<div align="center"> <img src="assets/Figure1.png" width="800"/> </div>

๐Ÿ“ฃ News

  • [24-8-12] ๐Ÿค— Our GiT was accepted by ECCV2024 with <font color=Red>Oral</font> presentation.
  • [24-7-01] ๐Ÿค— Our GiT was accepted by ECCV2024.
  • [24-3-15] ๐Ÿš€ Training and inference Code is released.
  • [24-3-15] ๐Ÿ‘€ GiT is released on arXiv.

๐Ÿ’ซ What we want to do

The Model Architectures across various AI domains are converging towards <font color=Red>Multi-Layer Plain Transformers</font>.

  • Language Modeling (GPT)
  • 2D Image Modeling (ViT)
  • 3D Point Cloud Modeling (DSVT)
  • 2D Image and 3D Point Cloud Joint Modeling (UniTR)
  • Graph Modeling (Graphormer)
  • $\cdot \cdot \cdot$

Reducing Human Bias in Model Architecture Designing

We aim to unify the model architecture of vision and language through a plain transformer, reducing human biases such as modality-specific encoders and task-specific heads. A key advancement in deep learning is the shift from hand-crafted to autonomously learned features, inspiring us to reduce human-designed aspects in architecture. Moreover, benefiting from the flexibility of plain transformers, our framework can extend to more modalities like point clouds and graphs.

๐Ÿค” What we achieve

Building a universal computation model across all tasks stands as the cornerstone of artificial intelligence, reducing the need for task-specific designs. In this project, we introduce GiT (Generalist Vision Transformer). GiT has the following characteristics:

  • ๐Ÿ˜ฎ Minimalist architecture design similar to LLM: GiT consists solely of a single transformer, without the inclusion of additional vision encoders and adapters.
  • ๐Ÿš€ Covering all types of visual understanding tasks: GiT addresses a spectrum of visual tasks, including object-level tasks (e.g., object detection), pixel-level tasks (e.g., semantic segmentation), and vision-language tasks (e.g., image captioning).
  • ๐Ÿค— Achieving multi-task ability by unified language interface: Similar to LLM, GiT observes the task synergy effect in multi-task training. It fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. No negative transfer phenomenon.
  • ๐Ÿ”ฅ Strong performance on zero-shot and few-shot benchmark: GiT scales well with model size and data, demonstrating remarkable generalizability across diverse scenarios after training on 27 datasets.
  • ๐Ÿ‘ Simple one-stage training strategy: GiT uses a very simple one-stage training strategy, fully embracing the training style utilized by the current LLM framework.

Overview

๐Ÿš€ Main Results

Single-Task Benchmark

| Model |Params| Metric | Perfomance |ckpt|log|config| |---------|---------|---------|--------|--------|---------|---------| | GiT-B<sub>detection</sub> | 131M|mAP|45.1 | ckpt|log| config| | GiT-B<sub>insseg</sub> | 131M|mAP|31.4 |ckpt|log| config | | GiT-B<sub>semseg</sub> | 131M|mIoU|47.7 |ckpt|log| config | | GiT-B<sub>caption</sub>| 131M|BLEU-4|33.7 | ckpt|log| config | | GiT-B<sub>grounding</sub>| 131M|Acc@0.5|83.3 | ckpt|log| config |

Multi-Tasking Benchmark

| Model |Params| Detection | Ins Seg| Sem Seg |Caption |Grounding |ckpt|log|config| |---------|---------|---------|--------|--------|---------|---------|---------|---------|---------| | GiT-B<sub>multi-task</sub> | 131M|46.7 | 31.9 | 47.8 |35.3|85.8|ckpt|log| config | | GiT-L<sub>multi-task</sub> | 387M|51.3 | 35.1 | 50.6|35.7|88.4|ckpt|log| config | | GiT-H<sub>multi-task</sub>| 756M|52.9 | 35.8 | 52.4|36.2|89.2|ckpt|log| config |

<!-- | GiT-B<sub>single-task</sub> | 131M|45.1 | 31.4| 47.7 |33.7|83.3|[ckpt](https://huggingface.co/kanashi6/GiT/blob/main/det_base.pth)|[log](https://huggingface.co/kanashi6/GiT/blob/main/det_base.log)| [config](https://github.com/Haiyang-W/GiT/blob/main/configs/GiT/single_detection_base.py)| -->

Task Synergy in Multi-Tasking Training

| Model |Params| Detection | Ins S

View on GitHub
GitHub Stars362
CategoryDevelopment
Updated1d ago
Forks15

Languages

Python

Security Score

100/100

Audited on Apr 7, 2026

No findings