UniFormer
[ICLR2022] official implementation of UniFormer
Install / Use
/learn @Sense-X/UniFormerREADME
UniFormer
<a src="https://img.shields.io/badge/%F0%9F%A4%97-Open%20in%20Spaces-blue" href="https://huggingface.co/spaces/Andy1621/uniformer_light "> <img src="https://img.shields.io/badge/%F0%9F%A4%97-Open%20in%20Spaces-blue" alt="Open in Huggingface"> </a> <a src="https://img.shields.io/badge/cs.CV-2305.06355-b31b1b?logo=arxiv&logoColor=red" href="https://arxiv.org/abs/2201.09450"> <img src="https://img.shields.io/badge/cs.CV-2305.06355-b31b1b?logo=arxiv&logoColor=red"> </a> <a src="https://img.shields.io/badge/cs.CV-2305.06355-b31b1b?logo=arxiv&logoColor=red" href="https://arxiv.org/abs/2201.04676"> <img src="https://img.shields.io/badge/cs.CV-2201.04676-b31b1b?logo=arxiv&logoColor=red"> </a>💬 This repo is the official implementation of:
- TPAMI2023: UniFormer: Unifying Convolution and Self-attention for Visual Recognition
- ICLR2022: UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning
🤖 It currently includes code and models for the following tasks:
- [x] Image Classification
- [x] Video Classification
- [x] Object Detection
- [x] Semantic Segmentation
- [x] Pose Estimation
- [x] Lightweght Model (see
exp_lightin each task)
🌟 Other popular repos:
- UniFormerV2: The first model to achieve 90% top-1 accuracy on Kinetics-400.
- Unmasked Teacher: Using only public sources for pre-training in 6 days on 32 A100 GPUs, our scratch-built ViT-L/16 achieves state-of-the-art performances on various video tasks.
- Ask-Anything: Ask anything in video and image!
⚠️ Note!!!!!
For downstream tasks:
- We forget to freeze BN in backbone, which will further improve the performance.
- We have verified that Token Labeling can largely help the downstream tasks. Have a try if you utilize UniFormer for competition or application.
- The
head_dimof some models are32, which will lead to large memory cost but little improvement for downstream tasks. Those models withhead_dim=64are released released in image_classification.
🔥 Updates
05/19/2023
The extension version has been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 🎉🎉🎉. In revision, we explore the simple yet effective lightweight design: Hourglass UniFormer. Based on that, we propose the efficient UniFormer-XS and UniFormer-XXS:
- For image tasks, they surpass MobileViT, PVTv2 and EfficientNet.
- For video tasks, they surpass X3D and MoViNet.
- Try our 🚀fast demo🚀 on CPU!
11/20/2022
We have released UniFormerV2, which aims to arming the pre-trained ViTs with efficient UniFormer designs. It can save a lot of reaining resources and achieve powerful performance on 8 popular benchmarks. Please have a try! 🎉🎉
10/26/2022
We have provided the code for video visualizations, please see video_classification/vis.
05/24/2022
- Some bugs for video recognition have been fixed in Nightcrawler. We successfully adapt UniFormer for extreme dark video classification! 🎉🎉
- More demos for Detection and Segmentation are provided. 👏😄
03/6/2022
Some models with head_dim=64 are released, which can save memory cost for downstream tasks.
02/9/2022
Some popular models and demos are updated in hugging face.
02/3/2022
Integrated into using Gradio. Have fun!
01/21/2022
UniFormer for video is accepted by ICLR2022 (8868, Top 3%)!
01/19/2022
- Pretrained models on ImageNet-1K with Token Labeling.
- Large resolution fine-tuning.
01/18/2022
- The supported code and models for COCO object detection.
- The supported code and models for ADE20K semantic segmentation.
- The supported code and models for COCO pose estimation.
01/13/2022
-
Pretrained models on ImageNet-1K, Kinetics-400, Kinetics-600, Something-Something V1&V2.
-
The supported code and models for image classification and video classification are provided.
📖 Introduction
UniFormer (Unified transFormer) is introduce in arxiv (more details can be found in arxiv), which can seamlessly integrate merits of convolution and self-attention in a concise transformer format. We adopt local MHRA in shallow layers to largely reduce computation burden and global MHRA in deep layers to learn global token relation.
Without any extra training data, our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification. With only ImageNet-1K pre-training, it can simply achieve state-of-the-art performance in a broad range of downstream tasks. Our UniFormer obtains 82.9/84.8 top-1 accuracy on Kinetics-400/600, and 60.9/71.2 top-1 accuracy on Something-Something V1/V2 video classification tasks. It also achieves 53.8 box AP and 46.4 mask AP on COCO object detection task, 50.8 mIoU on ADE20K semantic segmentation task, and 77.4 AP on COCO pose estimation task. Moreover, we build an efficient UniFormer with a concise hourglass design of token shrinking and recovering, which achieves 2-4× higher throughput than the recent lightweight models.
<div align=center> <h3> General Framework </h3> </div> <div align="center"> <img src="figures/framework.png" width="80%"> </div> <div align=center> <h3> Efficient Framework </h3> </div> <div align="center"> <img src="figures/efficient_uniformer.png" width="80%"> </div> <div align=center> <h3> Different Downstream Tasks </h3> </div> <div align="center"> <img src="figures/dense_adaption.jpg" width="100%"> </div>Main results on ImageNet-1K
Please see image_classification for more details.
More models with large resolution and token labeling will be released soon.
| Model | Pretrain | Resolution | Top-1 | #Param. | FLOPs | | --------------- | ----------- | ---------- | ----- | ------- | ----- | | UniFormer-XXS | ImageNet-1K | 128x128 | 76.8 | 10.2M | 0.43G | | UniFormer-XXS | ImageNet-1K | 160x160 | 79.1 | 10.2M | 0.67G | | UniFormer-XXS | ImageNet-1K | 192x192 | 79.9 | 10.2M | 0.96G | | UniFormer-XXS | ImageNet-1K | 224x224 | 80.6 | 10.2M | 1.3G | | UniFormer-XS | ImageNet-1K | 192x192 | 81.5 | 16.5M | 1.4G | | UniFormer-XS | ImageNet-1K | 224x224 | 82.0 | 16.5M | 2.0G | | UniFormer-S | ImageNet-1K | 224x224 | 82.9 | 22M | 3.6G | | UniFormer-S† | ImageNet-1K | 224x224 | 83.4 | 24M | 4.2G | | UniFormer-B | ImageNet-1K | 224x224 | 83.9 | 50M | 8.3G | | UniFormer-S+TL | ImageNet-1K | 224x224 | 83.4 | 22M | 3.6G | | UniFormer-S†+TL | ImageNet-1K | 224x224 | 83.9 | 24M | 4.2G | | UniFormer-B+TL | ImageNet-1K | 224x224 | 85.1 | 50M | 8.3G | | UniFormer-L+TL | ImageNet-1K | 224x224 | 85.6 | 100M | 12.6G | | UniFormer-S+TL | ImageNet-1K | 384x384 | 84.6 | 22M | 11.9G | | UniFormer-S†+TL | ImageNet-1K | 384x384 | 84.9 | 24M | 13.7G | | UniFormer-B+TL | ImageNet-1K | 384x384 | 86.0 | 50M | 27.2G | | UniFormer-L+TL | ImageNet-1K | 384x384 | 86.3 | 100M | 39.2G |
Main results on Kinetics video classification
Please see video_classification for more details.
| Model | Pretrain | #Frame | Sampling Stride | FLOPs | K400 Top-1 | K600 Top-1 | | ----------- | ----------- | ------ | --------------- | ----- | ---------- | ---------- | | UniFormer-S | ImageNet-1K | 16x1x4 | 4 | 167G | 80.8 | 82.8 | | UniFormer-S | ImageNet-1K | 16x1x4 | 8 | 167G | 80.8 | 82.7 | | UniFormer-S | ImageNet-1K | 32x1x4 | 4 | 438G | 82.0 | - | | UniFormer-B | ImageNet-1K | 16x1x4 | 4 | 387G | 82.0 | 84.0 | | UniFormer-B | ImageNet-1K | 16x1x4 | 8 | 387G | 81.7 | 83.4 | | UniFormer-B | ImageNet-1K | 32x1x4 | 4 | 1036G | 82.9 | 84.5* |
| Model | Pretrain | #Frame | Resolution | FLOPs | K400 Top-1 | | ------------- | ----------- | ------ | ---------- | ----- | ---------- | | UniFormer-XXS | ImageNet-1K | 4x1x1 | 128 | 1.0G | 63.2 | | UniFormer-XXS | ImageNet-1K | 4x1x1 | 160 | 1.6G | 65.8 | | UniFormer-XXS | ImageNet-1K | 8x1x1 | 128 | 2.0G | 68.3 | | UniFormer-XXS | ImageNet-1K | 8x1x1 | 160 | 3.3G | 71.4 | | UniFormer-XXS | ImageNet-1K | 16x1x1 | 128 | 4.2G | 73.3 | | UniFormer-XXS | ImageNet-1K | 16x1x1 | 160 | 6.9G | 75.1 | | UniFormer-XXS | ImageNet-1K
Related Skills
qqbot-channel
349.7kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.4k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
349.7kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
