GroundingDINO
[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
Install / Use
/learn @IDEA-Research/GroundingDINOREADME
<div align="center">
<img src="./.asset/grounding_dino_logo.png" width="30%">
</div>
:sauropod: Grounding DINO
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang<sup>:email:</sup>.
PyTorch implementation and pretrained models for Grounding DINO. For details, see the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
- 🔥 Grounded SAM 2 is released now, which combines Grounding DINO with SAM 2 for any object tracking in open-world scenarios.
- 🔥 Grounding DINO 1.5 is released now, which is IDEA Research's Most Capable Open-World Object Detection Model!
- 🔥 Grounding DINO and Grounded SAM are now supported in Huggingface. For more convenient use, you can refer to this documentation
:sun_with_face: Helpful Tutorial
- :grapes: [Read our arXiv Paper]
- :apple: [Watch our simple introduction video on YouTube]
- :blossom: [Try the Colab Demo]
- :sunflower: [Try our Official Huggingface Demo]
- :maple_leaf: [Watch the Step by Step Tutorial about GroundingDINO by Roboflow AI]
- :mushroom: [GroundingDINO: Automated Dataset Annotation and Evaluation by Roboflow AI]
- :hibiscus: [Accelerate Image Annotation with SAM and GroundingDINO by Roboflow AI]
- :white_flower: [Autodistill: Train YOLOv8 with ZERO Annotations based on Grounding-DINO and Grounded-SAM by Roboflow AI]
:sparkles: Highlight Projects
- Semantic-SAM: a universal image segmentation model to enable segment and recognize anything at any desired granularity.,
- DetGPT: Detect What You Need via Reasoning
- Grounded-SAM: Marrying Grounding DINO with Segment Anything
- Grounding DINO with Stable Diffusion
- Grounding DINO with GLIGEN for Controllable Image Editing
- OpenSeeD: A Simple and Strong Openset Segmentation Model
- SEEM: Segment Everything Everywhere All at Once
- X-GPT: Conversational Visual Agent supported by X-Decoder
- GLIGEN: Open-Set Grounded Text-to-Image Generation
- LLaVA: Large Language and Vision Assistant
:bulb: Highlight
- Open-Set Detection. Detect everything with language!
- High Performance. COCO zero-shot 52.5 AP (training without COCO data!). COCO fine-tune 63.0 AP.
- Flexible. Collaboration with Stable Diffusion for Image Editting.
:fire: News
2023/07/18: We release Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Code and checkpoint are available!2023/06/17: We provide an example to evaluate Grounding DINO on COCO zero-shot performance.2023/04/15: Refer to CV in the Wild Readings for those who are interested in open-set recognition!2023/04/08: We release demos to combine Grounding DINO with GLIGEN for more controllable image editings.2023/04/08: We release demos to combine Grounding DINO with Stable Diffusion for image editings.2023/04/06: We build a new demo by marrying GroundingDINO with Segment-Anything named Grounded-Segment-Anything aims to support segmentation in GroundingDINO.2023/03/28: A YouTube video about Grounding DINO and basic object detection prompt engineering. [SkalskiP]2023/03/28: Add a demo on Hugging Face Space!2023/03/27: Support CPU-only mode. Now the model can run on machines without GPUs.2023/03/25: A demo for Grounding DINO is available at Colab. [SkalskiP]2023/03/22: Code is available Now!
:star: Explanations/Tips for Grounding DINO Inputs and Outputs
- Grounding DINO accepts an
(image, text)pair as inputs. - It outputs
900(by default) object boxes. Each box has similarity scores across all input words. (as shown in Figures below.) - We defaultly choose the boxes whose highest similarities are higher than a
box_threshold. - We extract the words whose similarities are higher than the
text_thresholdas predicted labels. - If you want to obtain objects of specific phrases, like the
dogsin the sentencetwo dogs with a stick., you can select the boxes with highest text similarities withdogsas final outputs. - Note that each wor
