SkillAgentSearch skills...

InternImage

[CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Install / Use

/learn @OpenGVLab/InternImage
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p> <a href="./README_CN.md">[中文版本]</a> </p>

InternImage: Large-Scale Vision Foundation Model

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

The official implementation of

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions.

[Paper] [Blog in Chinese]

Highlights

  • :thumbsup: The strongest open-source visual universal backbone model with up to 3 billion parameters
  • 🏆 Achieved 90.1% Top1 accuracy in ImageNet, the most accurate among open-source models
  • 🏆 Achieved 65.5 mAP on the COCO benchmark dataset for object detection, the only model that exceeded 65.0 mAP

News

  • Jan 22, 2024: 🚀 Support DCNv4 in InternImage!
  • Feb 28, 2023: 🚀 InternImage is accepted to CVPR 2023!
  • Nov 18, 2022: 🚀 InternImage-XL merged into BEVFormer v2 achieves state-of-the-art performance of 63.4 NDS on nuScenes Camera Only.
  • Nov 10, 2022: 🚀 InternImage-H achieves a new record 65.4 mAP on COCO detection test-dev and 62.9 mIoU on ADE20K, outperforming previous models by a large margin.

History

  • [x] Models for other downstream tasks
  • [x] Support CVPR 2023 Workshop on End-to-End Autonomous Driving, see here
  • [x] Support extracting intermediate features, see here
  • [x] Low-cost training with DeepSpeed, see here
  • [x] Compiling-free .whl package of DCNv3 operator, see here
  • [x] InternImage-H(1B)/G(3B)
  • [x] TensorRT inference for classification/detection/segmentation models
  • [x] Classification code of the InternImage series
  • [x] InternImage-T/S/B/L/XL ImageNet-1K pretrained model
  • [x] InternImage-L/XL ImageNet-22K pretrained model
  • [x] InternImage-T/S/B/L/XL detection and instance segmentation model
  • [x] InternImage-T/S/B/L/XL semantic segmentation model

Introduction

InternImage is an advanced vision foundation model developed by researchers from Shanghai AI Laboratory, Tsinghua University, and other institutions. Unlike models based on Transformers, InternImage employs DCNv3 as its core operator. This approach equips the model with dynamic and effective receptive fields required for downstream tasks like object detection and segmentation, while enabling adaptive spatial aggregation.

<div align=center> <img src='./docs/figs/arch.png' width=400> </div>

Some other projects related to InternImage include the pretraining algorithm "M3I-Pretraining," the general-purpose decoder series "Uni-Perceiver," and the autonomous driving perception encoder series "BEVFormer."

<div align=left> <img src='./docs/figs/intern_pipeline_en.png' width=900> </div>

Performance

  • InternImage achieved an impressive Top-1 accuracy of 90.1% on the ImageNet benchmark dataset using only publicly available data for image classification. Apart from two undisclosed models trained with additional datasets by Google and Microsoft, InternImage is the only open-source model that achieves a Top-1 accuracy of over 90.0%, and it is also the largest model in scale worldwide.
  • InternImage outperformed all other models worldwide on the COCO object detection benchmark dataset with a remarkable mAP of 65.5, making it the only model that surpasses 65 mAP in the world.
  • InternImage also demonstrated world's best performance on 16 other important visual benchmark datasets, covering a wide range of tasks such as classification, detection, and segmentation, making it the top-performing model across multiple domains.

Classification

<table border="1" width="90%"> <tr align="center"> <th colspan="1"> Image Classification</th><th colspan="2"> Scene Classification </th><th colspan="1">Long-Tail Classification</th> </tr> <tr align="center"> <th>ImageNet</th><th>Places365</th><th>Places 205</th><th>iNaturalist 2018</th> </tr> <tr align="center"> <th>90.1</th><th>61.2</th><th>71.7</th><th>92.6</th> </tr> </table>

Detection

<table border="1" width="90%"> <tr align="center"> <th colspan="4"> General Object Detection </th><th colspan="3"> Long-Tail Object Detection </th><th colspan="1"> Autonomous Driving Object Detection </th><th colspan="1"> Dense Object Detection </th> </tr> <tr align="center"> <th>COCO</th><th>VOC 2007</th><th>VOC 2012</th><th>OpenImage</th><th>LVIS minival</th><th>LVIS val</th><th>BDD100K</th><th>nuScenes</th><th>CrowdHuman</th> </tr> <tr align="center"> <th>65.5</th><th>94.0</th><th>97.2</th><th>74.1</th><th>65.8</th><th>63.2</th><th>38.8</th><th>64.8</th><th>97.2</th> </tr> </table>

Segmentation

<table border="1" width="90%"> <tr align="center"> <th colspan="3">Semantic Segmentation</th><th colspan="1">Street Segmentation</th><th colspan="1">RGBD Segmentation</th> </tr> <tr align="center"> <th>ADE20K</th><th>COCO Stuff-10K</th><th>Pascal Context</th><th>CityScapes</th><th>NYU Depth V2</th> </tr> <tr align="cente

Related Skills

View on GitHub
GitHub Stars2.8k
CategoryDevelopment
Updated1h ago
Forks261

Languages

Python

Security Score

100/100

Audited on Mar 21, 2026

No findings