SkillAgentSearch skills...

UFO

[NeurIPS2025 Spotlight ๐Ÿ”ฅ ] Official implementation of ๐Ÿ›ธ "UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface"

Install / Use

/learn @nnnth/UFO
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Unifying Fine-grained Perception into MLLMs w/o Task Decoders. 16 tokens enable precise segmentation

<h5 align="center">

hf_paper arXiv License Hits GitHub issues GitHub closed issues <br>

</h5> <div align="center"> <img src="assets/demo2.png" width="800"/> </div>

This repo is the official implementation of paper: ๐Ÿ›ธ UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface as well as the follow-ups. We have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

Hao Tang, Chenwei Xie , Haiyang Wang, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng$^\dagger$, Liwei Wang $^\dagger$

  • Primary contact: Hao Tang ( tanghao@stu.pku.edu.cn )

๐Ÿ“ฃ News

  • [25-10-1] We release checkpoints of UFO-InternVL2.5-8B in repo.
  • [25-9-19] ๐Ÿ”ฅ UFO is accepted by NeurIPS 2025 as a Spotlight!
  • [25-3-12] We release separate repos of UFO-InternVL2-8B and add REC inference on InternVL repo.
  • [25-3-4] ๐Ÿš€ Training and inference Code is released.
  • [25-3-3] ๐Ÿ‘€ UFO is released on arXiv.

Overview

๐Ÿ‘€ Todo

  • [x] Release the arXiv version.
  • [x] Release code and models of multi-task training on UFO-ViT.
  • [x] Release code and models of fine-grained instruction tuning on UFO-InternVL2.5-8B and UFO-LLaVA-1.5-7B.
  • [x] Release full code and models of multi-task training on UFO-InternVL2.5-8B.

๐Ÿค” Introduction

Previous efforts to introduce fine-grained perception tasks into MLLMs rely heavily on task-specific decoders or suboptimal formats (e.g., polygons), impeding the visual unified modeling. To overcome this, we propose UFO:

  • ๐Ÿ˜ฎ We reformulate segmentation as embedding retrieval, where the mask token embedding computes similarity with image features by dot product, retrieving high-similarity positions to generate the mask.

  • ๐Ÿš€ We first explore the image representation capabilities of MLLMs. We argue that since MLLMs excel in understanding, the mask information is also in the image features and we just need to retrieve it.

  • ๐Ÿค— Fully aligned with open-ended Language interface: UFO unifies detection and segmentation through the open-ended language interface without any additional decoders, enabling seamless integration with MLLMs.

  • ๐Ÿ”ฅ Competitive performance: UFO surpasses GiT, a text-based generalist model, by 12.3 mAP on COCO instance segmentation and 3.3 mIoU on ADE20K. It also matches or exceeds decoder-based methods in various grounding tasks, eliminating the need for task-specific decoders.

<div align="center"> <img src="assets/Figure1.png" width="800"/> </div>

๐Ÿš€ Main Results

Single-Task Benchmark

| Model |Params| Metric | Perfomance |ckpt|config| |---------|---------|---------|--------|--------|---------| | UFO-ViT-B<sub>detection</sub> | 131M|mAP|47.8 | ckpt|config| | UFO-ViT-B<sub>insseg</sub> | 131M|mAP|42.6 |ckpt|config | | UFO-ViT-B<sub>semseg</sub> | 131M|mIoU|49.5 |ckpt|config | | UFO-ViT-B<sub>caption</sub>| 131M|BLEU-4|34.2 | ckpt| config | | UFO-ViT-B<sub>grounding</sub>| 131M|Acc@0.5|83.6 | ckpt|config |

Multi-Task Benchmark

| Model |Params| Detection | Ins Seg| Sem Seg |Caption |Grounding |ckpt|config| |---------|---------|---------|--------|--------|---------|---------|---------|---------| | UFO-ViT-B<sub>multi-task</sub> | 131M|48.3 | 43.5 | 50.2 |35.3|85.8|ckpt| config | | UFO-ViT-L<sub>multi-task</sub> | 387M|52.9 | 47.3 | 54.0 |35.9|88.5|ckpt|config | | UFO-ViT-H<sub>multi-task</sub>| 756M|54.1 | 48.1 | 55.7|37.6|89.2|ckpt| config |

Task Synergy in Multi-Tasking Training

| Model |Params| Detection | Ins Seg| Sem Seg |Caption |Grounding | |---------|---------|---------|--------|--------|---------|---------| | UFO-B<sub>single-task</sub> | 131M|47.8 | 42.6| 49.5 |34.2|83.6| | Improvement | |+0.5 | +0.9| +0.7 |+1.1|+2.2| | UFO-B<sub>multi-task</sub> | 131M|48.3 | 43.5 | 50.2 |35.3|85.8|

MLLM Performance on Multi-Task Benchmark

UFO-InternVL2.5-8B: | Resolution | Detection | Ins Seg| Sem Seg |Caption |Grounding |ckpt|config| |---------|---------|---------|--------|--------|---------|---------|---------| | 448x448| 44.0 | 37.4|53.9 |39.6 |90.4|ckpt|config | | 896x896|50.9 | 43.6 | 54.6|-|-|ckpt|config | | 1344x1344|51.9 | 45.2 | -|-|-|ckpt|config |

Visual Grounding

RefCOCO Validation Set | Model | REC | RES |ckpt|config| |---------|---------|---------|--------|--------| | UFO-LLaVA-1.5-7B |89.9|76.2|ckpt| config | | UFO-LLaVA-1.5-7B (ft) | 90.8 | 77.2|ckpt| config | | UFO-InternVL2.5-8B | 91.8 | 80.0|ckpt| config | | UFO-InternVL2.5-8B (ft) |93.1|81.0|ckpt| config |

Reasoning Segmentation

| Model | Overall | Short Query | Long Query | ckpt | config| |---------|---------|---------|--------|--------|---------| | UFO-LLaVA-1.5-7B |53.8|40.1| 58.2| ckpt| config | | UFO-LLaVA-1.5-7B (ft) | 58.0 | 46.3|61.7 | ckpt| config | | UFO-InternVL2.5-8B | 60.0 | 48.7| 63.6|

Related Skills

View on GitHub
GitHub Stars271
CategoryDevelopment
Updated8d ago
Forks11

Languages

Python

Security Score

80/100

Audited on Mar 26, 2026

No findings