SkillAgentSearch skills...

DOSOD

A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space

Install / Use

/learn @D-Robotics-AI-Lab/DOSOD
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <br> <img src="./assets/DOSOD_LOGO.png" width=70%> <br> <a href="https://github.com/YonghaoHe">Yonghao He</a><sup><span>1,*,🌟 </span></sup>, <a href="https://people.ucas.edu.cn/~suhu">Hu Su</a><sup><span>2,*,📧</span></sup>, <a href="https://github.com/HarveyYesan">Haiyong Yu</a><sup><span>1,*</span></sup>, <a href="https://cong-yang.github.io/">Cong Yang</a><sup><span>3</span></sup>, <a href="">Wei Sui</a><sup><span>1</span></sup>, <a href="">Cong Wang</a><sup><span>1</span></sup>, <a href="www.amnrlab.org">Song Liu</a><sup><span>4,📧</span></sup> <br>

* Equal contribution, 🌟 Project lead, 📧 Corresponding author

<sup>1</sup> D-Robotics, <br> <sup>2</sup> State Key Laboratory of Multimodal Artificial Intelligence Systems(MAIS), Institute of Automation of Chinese Academy of Sciences,<br> <sup>3</sup> BeeLab, School of Future Science and Engineering, Soochow University, <br> <sup>4</sup> the School of Information Science and Technology, ShanghaiTech University <br>

<div>

arxiv paper license

</div> </div>

🔥 Updates

[2024-12-27]: Decoupled Open-Set Object Detector (DOSOD) with ultra real-time speed and superior accuracy is released. We sincerely welcome all kinds of contributions, such as porting DOSOD to more edge-side platforms, and also welcome all kinds of opinions.

1. Introduction

1.1 Brief Introduction of DOSOD

<div align="center"> <img width=800px src="./assets/dosod_architecture.png"> </div>

Thanks to the new SOTA in open-vocabulary object detection established by YOLO-World, open-vocabulary detection has been extensively applied in various scenarios. Real-time open-vocabulary detection has attracted significant attention. In our paper, Decoupled Open-Set Object Detection (DOSOD) is proposed as a practical and highly efficient solution for supporting real-time OSOD tasks in robotic systems. Specifically, DOSOD is constructed based on the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to convert text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD functions like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection.

1.2 Repo Structure

Our implementation is based on YOLO-World, the newly added code can be found in the following scripts:

2. Model Overview

Following YOLO-World, we also pre-trained DOSOD-S/M/L from scratch on public datasets and conducted zero-shot evaluation on the LVIS minival and COCO val2017. All pre-trained models are released.

2.1 Zero-shot Evaluation on LVIS minival

<div><font size=2>

| model | Pre-train Data | Size | AP<sup>mini</sup> | AP<sub>r</sub> | AP<sub>c</sub> | AP<sub>f</sub> | weights | |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------|:-----|:-----------------:|:--------------:|:--------------:|:--------------:|:----------------------------------------------------------------------------------------------------------------------------------:| | <div style="text-align: center;">YOLO-Worldv1-S<br>(repo)</div> | O365+GoldG | 640 | 24.3 | 16.6 | 22.1 | 27.7 | HF Checkpoints 🤗 | | <div style="text-align: center;">YOLO-Worldv1-M<br>(repo)</div> | O365+GoldG | 640 | 28.6 | 19.7 | 26.6 | 31.9 | HF Checkpoints 🤗 | | <div style="text-align: center;">YOLO-Worldv1-L<br>(repo)</div> | O365+GoldG | 640 | 32.5 | 22.3 | 30.6 | 36.1 | HF Checkpoints 🤗 | | <div style="text-align: center;">YOLO-Worldv1-S<br>(paper)</div> | O365+GoldG | 640 | 26.2 | 19.1 | 23.6 | 29.8 | HF Checkpoints 🤗 | | <div style="text-align: center;">YOLO-Worldv1-M<br>(paper)</div> | O365+GoldG | 640 | 31.0 | 23.8 | 29.2 | 33.9 | HF Checkpoints 🤗 | | <div style="text-align: center;">YOLO-Worldv1-L<br>(paper)</div> | O365+GoldG | 640 | 35.0 | 27.1 | 32.8 | 38.3 | HF Checkpoints 🤗 | | YOLO-Worldv2-S | O365+GoldG | 640 | 22.7 | 16.3 | 20.8 | 25.5 | HF Checkpoints 🤗 | | YOLO-Worldv2-M | O365+GoldG | 640 | 30.0 | 25.0 | 27.2 | 33.4 | HF Checkpoints 🤗 | | YOLO-Worldv2-L | O365+GoldG | 640 | 33.0 | 22.6 | 32.0 | 35.8 | HF Checkpoints 🤗 | | DOSOD-S | O365+GoldG | 640 | 26.7 | 19.9 | 25.1 | 29.3 | HF Checkpoints 🤗 | | DOSOD-M | O365+GoldG | 640 | 31.3 | 25.7 | 29.6 | 33.7 | HF Checkpoints 🤗 | | DOSOD-L | O365+GoldG | 640 | 34.4 | 29.1 | 32.6 | 36.6 | HF Checkpoints 🤗 |

NOTE: The results of YOLO-Worldv1 from repo and paper are different.

</fon

Related Skills

View on GitHub
GitHub Stars101
CategoryDevelopment
Updated8d ago
Forks6

Languages

Python

Security Score

95/100

Audited on Mar 26, 2026

No findings