Yoloe
YOLOE: Real-Time Seeing Anything [ICCV 2025]
Install / Use
/learn @THU-MIG/YoloeREADME
YOLOE: Real-Time Seeing Anything
Official PyTorch implementation of YOLOE. ICCV 2025.
<p align="center"> <img src="figures/comparison.svg" width=70%> <br> Comparison of performance, training cost, and inference efficiency between YOLOE (Ours) and YOLO-Worldv2 in terms of open text prompts. </p>YOLOE: Real-Time Seeing Anything.
Ao Wang*, Lihao Liu*, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding
We introduce YOLOE(ye), a highly efficient, unified, and open object detection and segmentation model, like human eye, under different prompt mechanisms, like texts, visual inputs, and prompt-free paradigm, with zero inference and transferring overhead compared with closed-set YOLOs.
<!-- <p align="center"> <img src="figures/pipeline.svg" width=96%> <br> </p> --> <p align="center"> <img src="figures/visualization.svg" width=96%> <br> </p> <details> <summary> <font size="+1">Abstract</font> </summary> Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with $3\times$ less training cost and $1.4\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 $AP^b$ and 0.4 $AP^m$ gains over closed-set YOLOv8-L with nearly $4\times$ less training time. </details> <p></p> <p align="center"> <img src="figures/pipeline.svg" width=96%> <br> </p>Performance
Zero-shot detection evaluation
- Fixed AP is reported on LVIS
minivalset with text (T) / visual (V) prompts. - Training time is for text prompts with detection based on 8 Nvidia RTX4090 GPUs.
- FPS is measured on T4 with TensorRT and iPhone 12 with CoreML, respectively.
- For training data, OG denotes Objects365v1 and GoldG.
- YOLOE can become YOLOs after re-parameterization with zero inference and transferring overhead.
| Model | Size | Prompt | Params | Data | Time | FPS | $AP$ | $AP_r$ | $AP_c$ | $AP_f$ | Log | |---|---|---|---|---|---|---|---|---|---|---|---| | YOLOE-v8-S | 640 | T / V | 12M / 13M | OG | 12.0h | 305.8 / 64.3 | 27.9 / 26.2 | 22.3 / 21.3 | 27.8 / 27.7 | 29.0 / 25.7 | T / V | | YOLOE-v8-M | 640 | T / V | 27M / 30M | OG | 17.0h | 156.7 / 41.7 | 32.6 / 31.0 | 26.9 / 27.0 | 31.9 / 31.7 | 34.4 / 31.1 | T / V | | YOLOE-v8-L | 640 | T / V | 45M / 50M | OG | 22.5h | 102.5 / 27.2 | 35.9 / 34.2 | 33.2 / 33.2 | 34.8 / 34.6 | 37.3 / 34.1 | T / V | | YOLOE-11-S | 640 | T / V | 10M / 12M | OG | 13.0h | 301.2 / 73.3 | 27.5 / 26.3 | 21.4 / 22.5 | 26.8 / 27.1 | 29.3 / 26.4 | T / V | | YOLOE-11-M | 640 | T / V | 21M / 27M | OG | 18.5h | 168.3 / 39.2 | 33.0 / 31.4 | 26.9 / 27.1 | 32.5 / 31.9 | 34.5 / 31.7 | T / V | | YOLOE-11-L | 640 | T / V | 26M / 32M | OG | 23.5h | 130.5 / 35.1 | 35.2 / 33.7 | 29.1 / 28.1 | 35.0 / 34.6 | 36.5 / 33.8 | T / V |
Zero-shot segmentation evaluation
- The model is the same as above in Zero-shot detection evaluation.
- Standard AP<sup>m</sup> is reported on LVIS
valset with text (T) / visual (V) prompts.
| Model | Size | Prompt | $AP^m$ | $AP_r^m$ | $AP_c^m$ | $AP_f^m$ | |---|---|---|---|---|---|---| | YOLOE-v8-S | 640 | T / V | 17.7 / 16.8 | 15.5 / 13.5 | 16.3 / 16.7 | 20.3 / 18.2 | | YOLOE-v8-M | 640 | T / V | 20.8 / 20.3 | 17.2 / 17.0 | 19.2 / 20.1 | 24.2 / 22.0 | | YOLOE-v8-L | 640 | T / V | 23.5 / 22.0 | 21.9 / 16.5 | 21.6 / 22.1 | 26.4 / 24.3 | | YOLOE-11-S | 640 | T / V | 17.6 / 17.1 | 16.1 / 14.4 | 15.6 / 16.8 | 20.5 / 18.6 | | YOLOE-11-M | 640 | T / V | 21.1 / 21.0 | 17.2 / 18.3 | 19.6 / 20.6 | 24.4 / 22.6 | | YOLOE-11-L | 640 | T / V | 22.6 / 22.5 | 19.3 / 20.5 | 20.9 / 21.7 | 26.0 / 24.1 |
Prompt-free evaluation
- The model is the same as above in Zero-shot detection evaluation except the specialized prompt embedding.
- Fixed AP is reported on LVIS
minivalset and FPS is measured on Nvidia T4 GPU with Pytorch.
| Model | Size | Params | $AP$ | $AP_r$ | $AP_c$ | $AP_f$ | FPS | Log | |---|---|---|---|---|---|---|---|---| | YOLOE-v8-S | 640 | 13M | 21.0 | 19.1 | 21.3 | 21.0 | 95.8 | PF | | YOLOE-v8-M | 640 | 29M | 24.7 | 22.2 | 24.5 | 25.3 | 45.9 | PF | | YOLOE-v8-L | 640 | 47M | 27.2 | 23.5 | 27.0 | 28.0 | 25.3 | PF | | YOLOE-11-S | 640 | 11M | 20.6 | 18.4 | 20.2 | 21.3 | 93.0 | PF | | YOLOE-11-M | 640 | 24M | 25.5 | 21.6 | 25.5 | 26.1 | 42.5 | PF | | YOLOE-11-L | 640 | 29M | 26.3 | 22.7 | 25.8 | 27.5 | 34.9 | PF |
Downstream transfer on COCO
- During transferring, YOLOE-v8 / YOLOE-11 is exactly the same as YOLOv8 / YOLO11.
- For Linear probing, only the last conv in classification head is trainable.
- For Full tuning, all parameters are trainable.
| Model | Size | Epochs | $AP^b$ | $AP^b_{50}$ | $AP^b_{75}$ | $AP^m$ | $AP^m_{50}$ | $AP^m_{75}$ | Log | |---|---|---|---|---|---|---|---|---|---| | Linear probing | | | | | | | | | | | YOLOE-v8-S | 640 | 10 | 35.6 | 51.5 | 38.9 | 30.3 | 48.2 | 32.0 | LP | | YOLOE-v8-M | 640 | 10 | 42.2 | 59.2 | 46.3 | 35.5 | 55.6 | 37.7 | LP | | YOLOE-v8-L | 640 | 10 | 45.4 | 63.3 | 50.0 | 38.3 | 59.6 | 40.8 | LP | | YOLOE-11-S | 640 | 10 | 37.0 | 52.9 | 40.4 | 31.5 | 49.7 | 33.5 | LP | | YOLOE-11-M | 640 | 10 | 43.1 | 60.6 | 47.4 | 36.5 | 56.9 | 39.0 | LP | | YOLOE-11-L | 640 | 10 | 45.1 | 62.8 | 49.5 | 38.0 | 59.2 | 40.6 | LP | | Full tuning | | | | | | | | | | | YOLOE-v8-S | 640 | 160 | 45.0 | 61.6 | 49.1 | 36.7 | 58.3 | 39.1 | FT | | [YOLOE-v8-M](https:
