DOSOD

A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space

Generate Convert Improve

Install / Use

/learn @D-Robotics-AI-Lab/DOSOD

About this skill

Quality Score

0/100

README

<div align="center"> <br> <img src="./assets/DOSOD_LOGO.png" width=70%> <br> <a href="https://github.com/YonghaoHe">Yonghao He</a><sup><span>1,*,🌟 </span></sup>, <a href="https://people.ucas.edu.cn/~suhu">Hu Su</a><sup><span>2,*,📧</span></sup>, <a href="https://github.com/HarveyYesan">Haiyong Yu</a><sup><span>1,*</span></sup>, <a href="https://cong-yang.github.io/">Cong Yang</a><sup><span>3</span></sup>, <a href="">Wei Sui</a><sup><span>1</span></sup>, <a href="">Cong Wang</a><sup><span>1</span></sup>, <a href="www.amnrlab.org">Song Liu</a><sup><span>4,📧</span></sup> <br>

* Equal contribution, 🌟 Project lead, 📧 Corresponding author

<sup>1</sup> D-Robotics, <br> <sup>2</sup> State Key Laboratory of Multimodal Artificial Intelligence Systems(MAIS), Institute of Automation of Chinese Academy of Sciences,<br> <sup>3</sup> BeeLab, School of Future Science and Engineering, Soochow University, <br> <sup>4</sup> the School of Information Science and Technology, ShanghaiTech University <br>

<div>

</div> </div>

🔥 Updates

[2024-12-27]: Decoupled Open-Set Object Detector (DOSOD) with ultra real-time speed and superior accuracy is released. We sincerely welcome all kinds of contributions, such as porting DOSOD to more edge-side platforms, and also welcome all kinds of opinions.

1. Introduction

1.1 Brief Introduction of DOSOD

Thanks to the new SOTA in open-vocabulary object detection established by YOLO-World, open-vocabulary detection has been extensively applied in various scenarios. Real-time open-vocabulary detection has attracted significant attention. In our paper, Decoupled Open-Set Object Detection (DOSOD) is proposed as a practical and highly efficient solution for supporting real-time OSOD tasks in robotic systems. Specifically, DOSOD is constructed based on the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to convert text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD functions like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection.

1.2 Repo Structure

Our implementation is based on YOLO-World, the newly added code can be found in the following scripts:

yolo_world/models/detectors/dosod.py yolo_world/models/dense_heads/dosod_head.py <br> The two scripts contain the core code segments of DOSOD.
configs/dosod <br> This folder contains all DOSOD configs for training, evaluation and inference.
tools/generate_text_prompts_dosod.py <br> Generating texts embeddings for DOSOD
tools/reparameterize_dosod.py <br> Reparameterizing original weights with generated texts embeddings
tools/count_num_parameters.py <br> Simple code for calculating the amount of parameters
tools/evaluate_latency.sh <br> The shell for latency evaluation on NVIDIA GPU

2. Model Overview

Following YOLO-World, we also pre-trained DOSOD-S/M/L from scratch on public datasets and conducted zero-shot evaluation on the LVIS minival and COCO val2017. All pre-trained models are released.

2.1 Zero-shot Evaluation on LVIS minival

| model | Pre-train Data | Size | AP<sup>mini</sup> | AP<sub>r</sub> | AP<sub>c</sub> | AP<sub>f</sub> | weights | |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------|:-----|:-----------------:|:--------------:|:--------------:|:--------------:|:----------------------------------------------------------------------------------------------------------------------------------:| | <div style="text-align: center;">YOLO-Worldv1-S<br>(repo)</div> | O365+GoldG | 640 | 24.3 | 16.6 | 22.1 | 27.7 | HF Checkpoints 🤗 | | <div style="text-align: center;">YOLO-Worldv1-M<br>(repo)</div> | O365+GoldG | 640 | 28.6 | 19.7 | 26.6 | 31.9 | HF Checkpoints 🤗 | | <div style="text-align: center;">YOLO-Worldv1-L<br>(repo)</div> | O365+GoldG | 640 | 32.5 | 22.3 | 30.6 | 36.1 | HF Checkpoints 🤗 | | <div style="text-align: center;">YOLO-Worldv1-S<br>(paper)</div> | O365+GoldG | 640 | 26.2 | 19.1 | 23.6 | 29.8 | HF Checkpoints 🤗 | | <div style="text-align: center;">YOLO-Worldv1-M<br>(paper)</div> | O365+GoldG | 640 | 31.0 | 23.8 | 29.2 | 33.9 | HF Checkpoints 🤗 | | <div style="text-align: center;">YOLO-Worldv1-L<br>(paper)</div> | O365+GoldG | 640 | 35.0 | 27.1 | 32.8 | 38.3 | HF Checkpoints 🤗 | | YOLO-Worldv2-S | O365+GoldG | 640 | 22.7 | 16.3 | 20.8 | 25.5 | HF Checkpoints 🤗 | | YOLO-Worldv2-M | O365+GoldG | 640 | 30.0 | 25.0 | 27.2 | 33.4 | HF Checkpoints 🤗 | | YOLO-Worldv2-L | O365+GoldG | 640 | 33.0 | 22.6 | 32.0 | 35.8 | HF Checkpoints 🤗 | | DOSOD-S | O365+GoldG | 640 | 26.7 | 19.9 | 25.1 | 29.3 | HF Checkpoints 🤗 | | DOSOD-M | O365+GoldG | 640 | 31.3 | 25.7 | 29.6 | 33.7 | HF Checkpoints 🤗 | | DOSOD-L | O365+GoldG | 640 | 34.4 | 29.1 | 32.6 | 36.6 | HF Checkpoints 🤗 |

NOTE: The results of YOLO-Worldv1 from repo and paper are different.

</fon

Related Skills

node-connect

347.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

107.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。