SkillAgentSearch skills...

APE

[CVPR 2024] Aligning and Prompting Everything All at Once for Universal Visual Perception

Install / Use

/learn @shenyunhang/APE

README

APE: Aligning and Prompting Everything All at Once for Universal Visual Perception

<!-- <a href='https://github.com/shenyunhang/APE'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2312.02153'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/spaces/shenyunhang/APE'><img src='https://img.shields.io/badge/%F0%9F%A4%97-Demo-yellow'></a> <a href='https://huggingface.co/shenyunhang/APE'><img src='https://img.shields.io/badge/%F0%9F%A4%97-Model-yellow'></a> [![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE) --> <p align="center"> <img src="./.asset/ape.png" width="96%" height="96%"> </p>

<font size=7><div align='center' > :grapes: [Read our arXiv Paper]   :apple: [Try our Online Demo] </div></font>


<p align="center"> <img src="./.asset/example_1.png" width="96%" height="96%"> </p>

:bulb: Highlight

  • High Performance. SotA (or competitive) performance on 160 datasets with only one model.
  • Perception in the Wild. Detect and segment everything with thousands of vocabularies or language descriptions all at once.
  • Flexible. Support both foreground objects and background stuff for instance segmentation and semantic segmentation.

:fire: News

  • 2024.04.07 Release checkpoints for APE-Ti with only 6M backbone!
  • 2024.02.27 APE has been accepted to CVPR 2024!
  • 2023.12.05 Release training codes!
  • 2023.12.05 Release checkpoints for APE-L!
  • 2023.12.05 Release inference codes and demo!

:label: TODO

  • [x] Release inference code and demo.
  • [x] Release checkpoints.
  • [x] Release training codes.
  • [ ] Add clean docs.

:hammer_and_wrench: Install

  1. Clone the APE repository from GitHub:
git clone https://github.com/shenyunhang/APE
cd APE
  1. Install the required dependencies and APE:
pip3 install -r requirements.txt
python3 -m pip install -e .

:arrow_forward: Demo Localy

Web UI demo

pip3 install gradio
cd APE/demo
python3 app.py

This demo will detect GPUs and use one GPU if you have GPUs.

Please feel free to try our Online Demo!

<p align="center"> <img src="./.asset/demo.png" width="96%" height="96%"> </p>

:books: Data Prepare

Following here to prepare the following datasets:

| Name | COCO | LVIS | Objects365 | Openimages | VisualGenome | SA-1B | RefCOCO | GQA | PhraseCut | Flickr30k | | |:-----:|:-------:|:-------:|:-----------:|:----------:|:------------:|:-------:|:----------:|:-------:|:---------:|:---------:|:-------:| | Train | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | | Test | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ | | | | | | | | | | | | | | | | Name | ODinW | SegInW | Roboflow100 | ADE20k | ADE-full | BDD10k | Cityscapes | PC459 | PC59 | VOC | D3 | | Train | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | | Test | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |

Noted we do not use coco_2017_train for training.

Instead, we augment lvis_v1_train with annotations from coco, and keep the image set unchanged.

And we register it as lvis_v1_train+coco for instance segmentation and lvis_v1_train+coco_panoptic_separated for panoptic segmentation.

:test_tube: Inference

Infer on 160+ dataset

We provide several scripts to evaluate all models.

It is necessary to adjust the checkpoint location and GPU number in the scripts before running them.

scripts/eval_APE-L_D.sh
scripts/eval_APE-L_C.sh
scripts/eval_APE-L_B.sh
scripts/eval_APE-L_A.sh
scripts/eval_APE-Ti.sh

Infer on images or videos

APE-L_D

python3 demo/demo_lazy.py \
--config-file configs/LVISCOCOCOCOSTUFF_O365_OID_VGR_SA1B_REFCOCO_GQA_PhraseCut_Flickr30k/ape_deta/ape_deta_vitl_eva02_clip_vlf_lsj1024_cp_16x4_1080k.py \
--input image1.jpg image2.jpg image3.jpg \
--output /path/to/output/dir \
--confidence-threshold 0.1 \
--text-prompt 'person,car,chess piece of horse head' \
--with-box \
--with-mask \
--with-sseg \
--opts \
train.init_checkpoint=/path/to/APE-D/checkpoint \
model.model_language.cache_dir="" \
model.model_vision.select_box_nums_for_evaluation=500 \
model.model_vision.text_feature_bank_reset=True \

To disable xformers, add the following option:

model.model_vision.backbone.net.xattn=False \

To use pytorch version of MultiScaleDeformableAttention, add the following option:

model.model_vision.transformer.encoder.pytorch_attn=True \
model.model_vision.transformer.decoder.pytorch_attn=True \

:train: Training

Prepare backbone and language models

git lfs install
git clone https://huggingface.co/QuanSun/EVA-CLIP models/QuanSun/EVA-CLIP/
git clone https://huggingface.co/BAAI/EVA models/BAAI/EVA/
git clone https://huggingface.co/Yuxin-CV/EVA-02 models/Yuxin-CV/EVA-02/

Resize patch size:

python3 tools/eva_interpolate_patch_14to16.py --input models/QuanSun/EVA-CLIP/EVA02_CLIP_E_psz14_plus_s9B.pt --output models/QuanSun/EVA-CLIP/EVA02_CLIP_E_psz14to16_plus_s9B.pt --image_size 224
python3 tools/eva_interpolate_patch_14to16.py --input models/QuanSun/EVA-CLIP/EVA01_CLIP_g_14_plus_psz14_s11B.pt --output models/QuanSun/EVA-CLIP/EVA01_CLIP_g_14_plus_psz14to16_s11B.pt --image_size 224
python3 tools/eva_interpolate_patch_14to16.py --input models/QuanSun/EVA-CLIP/EVA02_CLIP_L_336_psz14_s6B.pt --output models/QuanSun/EVA-CLIP/EVA02_CLIP_L_336_psz14to16_s6B.pt --image_size 336
python3 tools/eva_interpolate_patch_14to16.py --input models/Yuxin-CV/EVA-02/eva02/pt/eva02_Ti_pt_in21k_p14.pt --output models/Yuxin-CV/EVA-02/eva02/pt/eva02_Ti_pt_in21k_p14to16.pt --image_size 224

Train APE-L_D

Single node:

python3 tools/train_net.py \
--num-gpus 8 \
--resume \
--config-file configs/LVISCOCOCOCOSTUFF_O365_OID_VGR_SA1B_REFCOCO_GQA_PhraseCut_Flickr30k/ape_deta/ape_deta_vitl_eva02_clip_vlf_lsj1024_cp_16x4_1080k_mdl.py \
train.output_dir=output/APE/configs/LVISCOCOCOCOSTUFF_O365_OID_VGR_SA1B_REFCOCO_GQA_PhraseCut_Flickr30k/ape_deta/ape_deta_vitl_eva02_clip_vlf_lsj1024_cp_16x4_1080k_mdl_`date +'%Y%m%d_%H%M%S'`

Multiple nodes:

python3 tools/train_net.py \
--dist-url="tcp://${MASTER_IP}:${MASTER_PORT}" \
--num-gpus ${HOST_GPU_NUM} \
--num-machines ${HOST_NUM} \
--machine-rank ${INDEX} \
--resume \
--config-file configs/LVISCOCOCOCOSTUFF_O365_OID_VGR_SA1B_REFCOCO_GQA_PhraseCut_Flickr30k/ape_deta/ape_deta_vitl_eva02_clip_vlf_lsj1024_cp_16x4_1080k_mdl.py \
train.output_dir=output/APE/configs/LVISCOCOCOCOSTUFF_O365_OID_VGR_SA1B_REFCOCO_GQA_PhraseCut_Flickr30k/ape_deta/ape_deta_vitl_eva02_clip_vlf_lsj1024_cp_16x4_1080k_mdl_`date +'%Y%m%d_%H'`0000

Train APE-L_C

Single node:

python3 tools/train_net.py \
--num-gpus 8 \
--resume \
--config-file configs/LVISCOCOCOCOSTUFF_O365_OID_VGR_SA1B_REFCOCO/ape_deta/ape_deta_vitl_eva02_vlf_lsj1024_cp_1080k.py \
train.output_dir=output/APE/configs/LVISCOCOCOCOSTUFF_O365_OID_VGR_SA1B_REFCOCO/ape_deta/ape_deta_vitl_eva02_vlf_lsj1024_cp_1080k_`date +'%Y%m%d_%H%M%S'`

Multiple nodes:

python3 tools/train_net.py \
--dist-url="tcp://${MASTER_IP}:${MASTER_PORT}" \
--num-gpus ${HOST_GPU_NUM} \
--num-machines ${HOST_NUM} \
--machine-rank ${INDEX} \
--resume \
--config-file configs/LVISCOCOCOCOSTUFF_O365_OID_VGR_SA1B_REFCOCO/ape_deta/ape_deta_vitl_eva02_vlf_lsj1024_cp_1080k.py \
train.output_dir=output/APE/configs/LVISCOCOCOCOSTUFF_O365_OID_VGR_SA1B_REFCOCO/ape_deta/ape_deta_vitl_eva02_vlf_lsj1024_cp_1080k_`date +'%Y%m%d_%H'`0000

Train APE-L_B

Single node:

python3 tools/train_net.py \
--num-gpus 8 \
--resume \
--config-file configs/LVISCOCOCOCOSTUFF_O365_OID_VGR_REFCOCO/ape_deta/ape_deta_vitl_eva02_vlf_lsj1024_cp_1080k.py \
train.output_dir=output/APE/configs/LVISCOCOCOCOSTUFF_O365_OID_VGR_REFCOCO/ape_deta/ape_deta_vitl_eva02_vlf_lsj1024_cp_1080k_`date +'%Y%m%d_%H%M%S'`

Multiple nodes:

python3 tools/train_net.py \
--dist-url="tcp://${MASTER_IP}:${MASTER_PORT}" \
--num-gpus ${HOST_GPU_NUM} \
--num-machines ${HOST_NUM} \
--machine-rank ${INDEX} \
--resume \
--config-file configs/LVISCOCOCOCOSTUFF_O365_OID_VGR_REFCOCO/ape_deta/ape_deta_vitl_eva02_vlf_lsj1024_cp_1080k.py \
train.output_dir=output/APE/configs/LVISCOCOCOCOSTUFF_O365_OID_VGR_REFCOCO/ape_deta/ape_deta_vitl_eva02_vlf_lsj1024_cp_1080k_`date +'%Y%m%d_%H'`0000

Train APE-L_A

Single node:

python3 tools/train_net.py \
--num-gpus 8 \
--resume \
--config-file configs/LVISCOCOCOCOSTUFF_O365_OID_VG/ape_deta/ape_deta_vitl_eva02_lsj1024_cp_720k.py \
train.output_dir=output/APE/configs/LVISCOCOCOCOSTUFF_O365_OID_VG/ape_deta/ape_deta_vitl_eva02_lsj1024_cp_720k_`date +'%Y%m%d_%H%M%S'`

Multiple nodes:

python3 tools/train_net.py \
--dist-url="tcp://${MASTER_IP}:${MASTER_PORT}" \
--num-gpus ${HOST_GPU_NUM} \
--num-machines ${HOST_NUM} \
--machine-rank ${INDEX} \
--resume \
--config-file configs/LVISCOCOCOCOSTUFF_O365_OID_VG/ape_deta/ape_deta_vitl_eva02_lsj1024_cp_720k.py \
train.output_dir=output/APE/configs/LVISCOCOCOCOSTUFF_O365_OID_VG/ape_deta/ape_deta_vitl_eva02_lsj1024_cp_720k_`date +'%Y%m%d_%H'`0000

Train APE-Ti

Single node:

python3 tools/train_net.py \
--num-gpus 8 \
--r

Related Skills

View on GitHub
GitHub Stars609
CategoryDevelopment
Updated2d ago
Forks46

Languages

Python

Security Score

100/100

Audited on Apr 1, 2026

No findings