DOSOD
A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space
Install / Use
/learn @D-Robotics-AI-Lab/DOSODREADME
* Equal contribution, 🌟 Project lead, 📧 Corresponding author
<sup>1</sup> D-Robotics, <br> <sup>2</sup> State Key Laboratory of Multimodal Artificial Intelligence Systems(MAIS), Institute of Automation of Chinese Academy of Sciences,<br> <sup>3</sup> BeeLab, School of Future Science and Engineering, Soochow University, <br> <sup>4</sup> the School of Information Science and Technology, ShanghaiTech University <br>
<div> </div> </div>🔥 Updates
[2024-12-27]: Decoupled Open-Set Object Detector (DOSOD) with ultra real-time speed and superior accuracy is released. We sincerely welcome all kinds of contributions, such as porting DOSOD to more edge-side platforms, and also welcome all kinds of opinions.
1. Introduction
1.1 Brief Introduction of DOSOD
<div align="center"> <img width=800px src="./assets/dosod_architecture.png"> </div>Thanks to the new SOTA in open-vocabulary object detection established by YOLO-World, open-vocabulary detection has been extensively applied in various scenarios. Real-time open-vocabulary detection has attracted significant attention. In our paper, Decoupled Open-Set Object Detection (DOSOD) is proposed as a practical and highly efficient solution for supporting real-time OSOD tasks in robotic systems. Specifically, DOSOD is constructed based on the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to convert text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD functions like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection.
1.2 Repo Structure
Our implementation is based on YOLO-World, the newly added code can be found in the following scripts:
- yolo_world/models/detectors/dosod.py yolo_world/models/dense_heads/dosod_head.py <br> The two scripts contain the core code segments of DOSOD.
- configs/dosod <br> This folder contains all DOSOD configs for training, evaluation and inference.
- tools/generate_text_prompts_dosod.py <br> Generating texts embeddings for DOSOD
- tools/reparameterize_dosod.py <br> Reparameterizing original weights with generated texts embeddings
- tools/count_num_parameters.py <br> Simple code for calculating the amount of parameters
- tools/evaluate_latency.sh <br> The shell for latency evaluation on NVIDIA GPU
2. Model Overview
Following YOLO-World, we also pre-trained DOSOD-S/M/L from scratch on public datasets and conducted zero-shot evaluation on the LVIS minival and COCO val2017.
All pre-trained models are released.
2.1 Zero-shot Evaluation on LVIS minival
<div><font size=2>| model | Pre-train Data | Size | AP<sup>mini</sup> | AP<sub>r</sub> | AP<sub>c</sub> | AP<sub>f</sub> | weights | |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------|:-----|:-----------------:|:--------------:|:--------------:|:--------------:|:----------------------------------------------------------------------------------------------------------------------------------:| | <div style="text-align: center;">YOLO-Worldv1-S<br>(repo)</div> | O365+GoldG | 640 | 24.3 | 16.6 | 22.1 | 27.7 | HF Checkpoints 🤗 | | <div style="text-align: center;">YOLO-Worldv1-M<br>(repo)</div> | O365+GoldG | 640 | 28.6 | 19.7 | 26.6 | 31.9 | HF Checkpoints 🤗 | | <div style="text-align: center;">YOLO-Worldv1-L<br>(repo)</div> | O365+GoldG | 640 | 32.5 | 22.3 | 30.6 | 36.1 | HF Checkpoints 🤗 | | <div style="text-align: center;">YOLO-Worldv1-S<br>(paper)</div> | O365+GoldG | 640 | 26.2 | 19.1 | 23.6 | 29.8 | HF Checkpoints 🤗 | | <div style="text-align: center;">YOLO-Worldv1-M<br>(paper)</div> | O365+GoldG | 640 | 31.0 | 23.8 | 29.2 | 33.9 | HF Checkpoints 🤗 | | <div style="text-align: center;">YOLO-Worldv1-L<br>(paper)</div> | O365+GoldG | 640 | 35.0 | 27.1 | 32.8 | 38.3 | HF Checkpoints 🤗 | | YOLO-Worldv2-S | O365+GoldG | 640 | 22.7 | 16.3 | 20.8 | 25.5 | HF Checkpoints 🤗 | | YOLO-Worldv2-M | O365+GoldG | 640 | 30.0 | 25.0 | 27.2 | 33.4 | HF Checkpoints 🤗 | | YOLO-Worldv2-L | O365+GoldG | 640 | 33.0 | 22.6 | 32.0 | 35.8 | HF Checkpoints 🤗 | | DOSOD-S | O365+GoldG | 640 | 26.7 | 19.9 | 25.1 | 29.3 | HF Checkpoints 🤗 | | DOSOD-M | O365+GoldG | 640 | 31.3 | 25.7 | 29.6 | 33.7 | HF Checkpoints 🤗 | | DOSOD-L | O365+GoldG | 640 | 34.4 | 29.1 | 32.6 | 36.6 | HF Checkpoints 🤗 |
NOTE: The results of YOLO-Worldv1 from repo and paper are different.
</fon
Related Skills
node-connect
347.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
