ESAM
[ICLR 2025, Oral] EmbodiedSAM: Online Segment Any 3D Thing in Real Time
Install / Use
/learn @xuxw98/ESAMREADME
EmbodiedSAM: Online Segment Any 3D Thing in Real Time
Paper | Project Page | Video | 中文解读
EmbodiedSAM: Online Segment Any 3D Thing in Real Time
Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu
In this work, we presented ESAM, an efficient framework that leverages vision foundation models for <b>online</b>, <b>real-time</b>, <b>fine-grained</b>, <b>generalized</b> and <b>open-vocabulary</b> 3D instance segmentation.
News
- [2025/4/03]: Custom dataset is supported! Users can run EmbodiedSAM on their own data following here.
- [2025/2/11]: EmbodiedSAM is selected as an <b>oral presentation</b> in ICLR 2025!
- [2025/1/23]: EmbodiedSAM is accepted to ICLR 2025 with a top 2% rating!
- [2024/8/22]: Code and demo released.
Demo
Real-world:

Bedroom:
<img src="./assets/demo2.gif" width="450" />Office:
<img src="./assets/demo1.gif" width="450" />Demos are a little bit large; please wait a moment to load them. Welcome to the home page for more complete demos and detailed introductions.
Method
Method Pipeline:

Getting Started
For environment setup and dataset preparation, please follow:
For training and evaluation, please follow:
For visualization on the provided datasets or your own data, please follow:
Main Results
We provide the checkpoints for quick reproduction of the results reported in the paper. In addition to Tsinghua Cloud, we also upload the checkpoints and processed data to HuggingFace. Click here for more details.
Class-agnostic 3D instance segmentation results on ScanNet200 dataset:
| Method | Type | VFM | AP | AP@50 | AP@25 | Speed(ms) | Downloads | |:--------:|:-------:|:-----------:|:----:|:-----:|:-----:|:---------:|:---------:| | SAMPro3D | Offline | SAM | 18.0 | 32.8 | 56.1 | -- | -- | | SAI3D | Offline | SemanticSAM | 30.8 | 50.5 | 70.6 | -- | -- | | SAM3D | Online | SAM | 20.6 | 35.7 | 55.5 | 1369+1518 | -- | | ESAM | Online | SAM | 42.2 | 63.7 | 79.6 | 1369+80 | model | | ESAM-E | Online | FastSAM | 43.4 | 65.4 | 80.9 | 20+80 | model |
Dataset transfer results from ScanNet200 to SceneNN and 3RScan:
<table class="tg"><thead> <tr> <th class="tg-b2st" rowspan="2">Method</th> <th class="tg-b2st" rowspan="2">Type </th> <th class="tg-b2st" colspan="3">ScanNet200-->SceneNN</th> <th class="tg-b2st" colspan="3">ScanNet200-->3RScan</th> </tr> <tr> <th class="tg-wa1i">AP</th> <th class="tg-wa1i">AP@50</th> <th class="tg-wa1i">AP@25</th> <th class="tg-wa1i">AP</th> <th class="tg-wa1i">AP@50</th> <th class="tg-wa1i">AP@25</th> </tr></thead> <tbody> <tr> <td class="tg-nrix">SAMPro3D</td> <td class="tg-nrix">Offline</td> <td class="tg-nrix">12.6</td> <td class="tg-nrix">25.8</td> <td class="tg-nrix">53.2</td> <td class="tg-nrix">3.9</td> <td class="tg-nrix">8.0</td> <td class="tg-nrix">21.0</td> </tr> <tr> <td class="tg-nrix">SAI3D</td> <td class="tg-nrix">Offline</td> <td class="tg-nrix">18.6</td> <td class="tg-nrix">34.7</td> <td class="tg-nrix">65.7</td> <td class="tg-nrix">5.4</td> <td class="tg-nrix">11.8</td> <td class="tg-nrix">27.4</td> </tr> <tr> <td class="tg-nrix">SAM3D</td> <td class="tg-nrix">Online</td> <td class="tg-nrix">15.1</td> <td class="tg-nrix">30.0</td> <td class="tg-nrix">51.8</td> <td class="tg-nrix">6.2</td> <td class="tg-nrix">13.0</td> <td class="tg-nrix">33.9</td> </tr> <tr> <td class="tg-nrix">ESAM</td> <td class="tg-nrix">Online</td> <td class="tg-nrix"><b>28.8</b></td> <td class="tg-nrix"><b>52.2</b></td> <td class="tg-nrix">69.3</td> <td class="tg-nrix"><b>14.1</b></td> <td class="tg-nrix"><b>31.2</b></td> <td class="tg-nrix"><b>59.6</b></td> </tr> <tr> <td class="tg-nrix">ESAM-E</td> <td class="tg-nrix">Online</td> <td class="tg-nrix">28.6</td> <td class="tg-nrix">50.4</td> <td class="tg-nrix"><b>71.0</b></td> <td class="tg-nrix">13.9</td> <td class="tg-nrix">29.4</td> <td class="tg-nrix">58.8</td> </tr> </tbody></table>3D instance segmentation results on ScanNet dataset:
<table class="tg"><thead> <tr> <th class="tg-gabo" rowspan="2">Method</th> <th class="tg-gabo" rowspan="2">Type</th> <th class="tg-gabo" colspan="3">ScanNet</th> <th class="tg-gabo" colspan="3">SceneNN</th> <th class="tg-gabo" rowspan="2">FPS</th> <th class="tg-gabo" rowspan="2">Download</th> </tr> <tr> <th class="tg-uzvj">AP</th> <th class="tg-uzvj">AP@50</th> <th class="tg-uzvj">AP@25</th> <th class="tg-uzvj">AP</th> <th class="tg-uzvj">AP@50</th> <th class="tg-uzvj">AP@25</th> </tr></thead> <tbody> <tr> <td class="tg-9wq8"><a href=https://github.com/SamsungLabs/td3d>TD3D</a></td> <td class="tg-9wq8">offline</td> <td class="tg-9wq8">46.2</td> <td class="tg-9wq8">71.1</td> <td class="tg-9wq8">81.3</td> <td class="tg-9wq8">--</td> <td class="tg-9wq8">--</td> <td class="tg-9wq8">--</td> <td class="tg-9wq8">--</td> <td class="tg-9wq8">--</td> </tr> <tr> <td class="tg-9wq8"><a href=https://github.com/oneformer3d/oneformer3d>Oneformer3D</a></td> <td class="tg-9wq8">offline</td> <td class="tg-9wq8">59.3</td> <td class="tg-9wq8">78.8</td> <td class="tg-9wq8">86.7</td> <td class="tg-9wq8">--</td> <td class="tg-9wq8">--</td> <td class="tg-9wq8">--</td> <td class="tg-9wq8">--</td> <td class="tg-9wq8">--</td> </tr> <tr> <td class="tg-9wq8"><a href=https://github.com/THU-luvision/INS-Conv>INS-Conv</a></td> <td class="tg-9wq8">online</td> <td class="tg-9wq8">--</td> <td class="tg-9wq8">57.4</td> <td class="tg-9wq8">--</td> <td class="tg-9wq8">--</td> <td class="tg-9wq8">--</td> <td class="tg-9wq8">--</td> <td class="tg-9wq8">--</td> <td class="tg-9wq8">--</td> </tr> <tr> <td class="tg-9wq8"><a href=https://github.com/xuxw98/Online3D>TD3D-MA</a></td> <td class="tg-9wq8">online</td> <td class="tg-9wq8">39.0</td> <td class="tg-9wq8">60.5</td> <td class="tg-9wq8">71.3</td> <td class="tg-9wq8">26.0</td> <td class="tg-9wq8">42.8</td> <td class="tg-9wq8">59.2</td> <td class="tg-9wq8">3.5</td> <td class="tg-9wq8">--</td> </tr> <tr> <td class="tg-9wq8">ESAM-E</td> <td class="tg-9wq8">online</td> <td class="tg-9wq8">41.6</td> <td class="tg-9wq8">60.1</td> <td class="tg-9wq8">75.6</td> <td class="tg-9wq8">27.5</td> <td class="tg-9wq8">48.7</td> <td class="tg-uzvj"><b>64.6</b></td> <td class="tg-uzvj"><b>10</b></td> <td class="tg-9wq8"><a href=https://cloud.tsinghua.edu.cn/f/1eeff1152a5f4d4989da/?dl=1>model</a></td> </tr> <tr> <td class="tg-nrix">ESAM-E+FF</td> <td class="tg-nrix">online</td> <td class="tg-wa1i"><b>42.6</b></td> <td class="tg-wa1i"><b>61.9</b></td> <td class="tg-wa1i"><b>77.1</b></td> <td class="tg-wa1i"><b>33.3</b></td> <td class="tg-wa1i"><b>53.6</b></td> <td class="tg-nrix">62.5</td> <td class="tg-nrix">9.8</td> <td class="tg-nrix"><a href=https://cloud.tsinghua.edu.cn/f/4c2dd1559e854f48be76/?dl=1>model</a></td> </tr> </tbody></table>Open-Vocabulary 3D instance segmentation results on ScanNet200 dataset: | Method | AP | AP@50 | AP@25 | |:------:|:----:|:-----:|:-----:| | SAI3D | 9.6 | 14.7 | 19.0 | | ESAM | 13.7 | 19.2 | 23.9 |
TODO List
- [x] Release code and checkpoints.
- [x] Release the demo code to directly run ESAM on streaming RGB-D video.
Contributors
Both students below contributed equally and the order is determined by random draw.
- Xiuwei Xu
- Huangxing Chen
Both advised by Jiwen Lu.
Acknowledgement
We thank a lot for the flexible codebase of Oneformer3D and Online3D, as well as the valuable datasets provided by ScanNet, SceneNN and 3RScan.
Citation
@article{xu2024esam,
title={EmbodiedSAM: Online Segment Any 3D Thing in Real Time},
author={Xiuwei Xu and Huangxing Chen and Linqing Zhao and Ziwei Wang and Jie Zho
Related Skills
qqbot-channel
349.2kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.3k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
349.2kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
arscontexta
3.0kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
