InternImage
[CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Install / Use
/learn @OpenGVLab/InternImageREADME
InternImage: Large-Scale Vision Foundation Model
The official implementation of
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions.
[Paper] [Blog in Chinese]
Highlights
- :thumbsup: The strongest open-source visual universal backbone model with up to 3 billion parameters
- 🏆 Achieved
90.1% Top1accuracy in ImageNet, the most accurate among open-source models - 🏆 Achieved
65.5 mAPon the COCO benchmark dataset for object detection, the only model that exceeded65.0 mAP
News
Jan 22, 2024: 🚀 Support DCNv4 in InternImage!Feb 28, 2023: 🚀 InternImage is accepted to CVPR 2023!Nov 18, 2022: 🚀 InternImage-XL merged into BEVFormer v2 achieves state-of-the-art performance of63.4 NDSon nuScenes Camera Only.Nov 10, 2022: 🚀 InternImage-H achieves a new record65.4 mAPon COCO detection test-dev and62.9 mIoUon ADE20K, outperforming previous models by a large margin.
History
- [x] Models for other downstream tasks
- [x] Support CVPR 2023 Workshop on End-to-End Autonomous Driving, see here
- [x] Support extracting intermediate features, see here
- [x] Low-cost training with DeepSpeed, see here
- [x] Compiling-free
.whlpackage of DCNv3 operator, see here - [x] InternImage-H(1B)/G(3B)
- [x] TensorRT inference for classification/detection/segmentation models
- [x] Classification code of the InternImage series
- [x] InternImage-T/S/B/L/XL ImageNet-1K pretrained model
- [x] InternImage-L/XL ImageNet-22K pretrained model
- [x] InternImage-T/S/B/L/XL detection and instance segmentation model
- [x] InternImage-T/S/B/L/XL semantic segmentation model
Introduction
InternImage is an advanced vision foundation model developed by researchers from Shanghai AI Laboratory, Tsinghua University, and other institutions. Unlike models based on Transformers, InternImage employs DCNv3 as its core operator. This approach equips the model with dynamic and effective receptive fields required for downstream tasks like object detection and segmentation, while enabling adaptive spatial aggregation.
<div align=center> <img src='./docs/figs/arch.png' width=400> </div>Some other projects related to InternImage include the pretraining algorithm "M3I-Pretraining," the general-purpose decoder series "Uni-Perceiver," and the autonomous driving perception encoder series "BEVFormer."
<div align=left> <img src='./docs/figs/intern_pipeline_en.png' width=900> </div>Performance
- InternImage achieved an impressive Top-1 accuracy of 90.1% on the ImageNet benchmark dataset using only publicly available data for image classification. Apart from two undisclosed models trained with additional datasets by Google and Microsoft, InternImage is the only open-source model that achieves a Top-1 accuracy of over 90.0%, and it is also the largest model in scale worldwide.
- InternImage outperformed all other models worldwide on the COCO object detection benchmark dataset with a remarkable mAP of 65.5, making it the only model that surpasses 65 mAP in the world.
- InternImage also demonstrated world's best performance on 16 other important visual benchmark datasets, covering a wide range of tasks such as classification, detection, and segmentation, making it the top-performing model across multiple domains.
Classification
<table border="1" width="90%"> <tr align="center"> <th colspan="1"> Image Classification</th><th colspan="2"> Scene Classification </th><th colspan="1">Long-Tail Classification</th> </tr> <tr align="center"> <th>ImageNet</th><th>Places365</th><th>Places 205</th><th>iNaturalist 2018</th> </tr> <tr align="center"> <th>90.1</th><th>61.2</th><th>71.7</th><th>92.6</th> </tr> </table>Detection
<table border="1" width="90%"> <tr align="center"> <th colspan="4"> General Object Detection </th><th colspan="3"> Long-Tail Object Detection </th><th colspan="1"> Autonomous Driving Object Detection </th><th colspan="1"> Dense Object Detection </th> </tr> <tr align="center"> <th>COCO</th><th>VOC 2007</th><th>VOC 2012</th><th>OpenImage</th><th>LVIS minival</th><th>LVIS val</th><th>BDD100K</th><th>nuScenes</th><th>CrowdHuman</th> </tr> <tr align="center"> <th>65.5</th><th>94.0</th><th>97.2</th><th>74.1</th><th>65.8</th><th>63.2</th><th>38.8</th><th>64.8</th><th>97.2</th> </tr> </table>Segmentation
<table border="1" width="90%"> <tr align="center"> <th colspan="3">Semantic Segmentation</th><th colspan="1">Street Segmentation</th><th colspan="1">RGBD Segmentation</th> </tr> <tr align="center"> <th>ADE20K</th><th>COCO Stuff-10K</th><th>Pascal Context</th><th>CityScapes</th><th>NYU Depth V2</th> </tr> <tr align="centeRelated Skills
node-connect
327.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
80.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
327.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
80.7kCommit, push, and open a PR
