EmbodiedScan
[CVPR 2024 & NeurIPS 2024] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
Install / Use
/learn @InternRobotics/EmbodiedScanREADME
This repository contains EmbodiedScan-series works for holistic multi-modal 3D perception, currently including EmbodiedScan & MMScan.
<div id="top" align="center"> </div>🤖 Demo
<!-- <div style="text-align: center;"> <img src="assets/demo_fig.png" alt="Dialogue_Teaser" width=100% > </div> --> <!-- contents with emoji -->📋 Contents
🏠 About
<!--  --> <div style="text-align: center;"> <img src="assets/teaser.png" alt="Dialogue_Teaser" width=100% > </div> In the realm of computer vision and robotics, embodied agents are expected to explore their environment and carry out human instructions. This necessitates the ability to fully understand 3D scenes given their first-person observations and contextualize them into language for interaction. However, traditional research focuses more on scene-level input and output setups from a global view. To address the gap, we introduce <b>EmbodiedScan, a multi-modal, ego-centric 3D perception dataset and benchmark for holistic 3D scene understanding.</b> It encompasses over <b>5k scans encapsulating 1M ego-centric RGB-D views, 1M language prompts, 160k 3D-oriented boxes spanning over 760 categories, some of which partially align with LVIS, and dense semantic occupancy with 80 common categories.</b> Building upon this database, we introduce a baseline framework named <b>Embodied Perceptron</b>. It is capable of processing an arbitrary number of multi-modal inputs and demonstrates remarkable 3D perception capabilities, both within the two series of benchmarks we set up, i.e., fundamental 3D perception tasks and language-grounded tasks, and <b>in the wild</b>.🔥 News
- [2025-01] We are delighted to present the official release of MMScan-devkit, which encompasses a suite of data processing utilities, benchmark evaluation tools, and adaptations of some models for the MMScan benchmarks. We invite you to explore these resources and welcome any feedback or questions you may have!
- [2024-09] We are pleased to announce the release of EmbodiedScan v2 beta, with original annotations on newly added ~5k scans from ARKitScenes and the beta version of MMScan's annotations on the original 5k scans. Fill in the form to apply for downloading. Welcome for any feedback!
- [2024-08] We preliminarily release the sample data of MMScan and the full release will be ready with ARKitScenes' annotations this month, which will be announced via emails to the community. Please stay tuned!
- [2024-06] The report of our follow-up work with the most-ever hierarchical grounded language annotations, MMScan, has been released. Welcome to talk with us about EmbodiedScan and MMScan at Seattle, CVPR 2024!
- [2024-04] We release all the baselines with pretrained models and logs. Welcome to try and play with them on our demo data! Note that we rename some keys in the multi-view 3D detection and visual grounding model. Please re-download the pretrained models if you just use our code for inference.
- [2024-03] The challenge test server is also online here. Looking forward to your strong submissions!
- [2024-03] We first release the data and baselines for the challenge. Please fill in the form to apply for downloading the data and try our baselines. Welcome any feedback!
- [2024-02] We will co-organize Autonomous Grand Challenge in CVPR 2024. Welcome to try the Multi-View 3D Visual Grounding track! We will release more details about the challenge with the baseline after the Chinese New Year.
- [2023-12] We release the paper of EmbodiedScan. Please check the webpage and view our demos!
📚 Getting Started
Installation
We test our codes under the following environment:
- Ubuntu 20.04
- NVIDIA Driver: 525.147.05
- CUDA 12.0
- Python 3.8.18
- PyTorch 1.11.0+cu113
- PyTorch3D 0.7.2
- Clone this repository.
git clone https://github.com/OpenRobotLab/EmbodiedScan.git
cd EmbodiedScan
- Create an environment and install PyTorch.
conda create -n embodiedscan python=3.8 -y # pytorch3d needs python>3.7
conda activate embodiedscan
# Install PyTorch, for example, install PyTorch 1.11.0 for CUDA 11.3
# For more information, please refer to https://pytorch.org/get-started/locally/
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
- Install EmbodiedScan.
# We plan to make EmbodiedScan easier to install by "pip install EmbodiedScan".
# Please stay tuned for the future official release.
# Make sure you are under ./EmbodiedScan/
# This script will install the dependencies and EmbodiedScan package automatically.
# use [python install.py run] to install only the execution dependencies
# use [python install.py visual] to install only the visualization dependencies
python install.py all # install all the dependencies
Note: The automatic installation script make each step a subprocess and the related messages are only printed when the subprocess is finished or killed. Therefore, it is normal to seemingly hang when installing heavier packages, such as Mink Engine and PyTorch3D.
BTW, from our experience, it is easier to encounter problems when installing these two packages. Feel free to post your questions or suggestions during the installation procedure.
Data Preparation
Please refer to the guide for downloading and organization.
Tutorial
We provide a simple tutorial here as a guideline for the basic analysis and visualization of our dataset. Welcome to try and post your suggestions!
Demo Inference
We provide a demo for running EmbodiedScan's model on a sample scan. Please download the raw data from Google Drive or BaiduYun and refer to the notebook for more details.
📦 Model and Benchmark
Model Overview
<p align="center"> <img src="assets/framework.png" align="center" width="100%"> </p> Embodied Perceptron accepts RGB-D sequence with any number of views along with texts as multi-modal input. It uses classical encoders to extract features for each modality and adopts dense and isomorphic sparse fusion with corresponding decoders for different predictions. The 3D features integrated with the text feature can be further used for language-grounded understanding. <!-- #### Pipeline Flow <video src="assets/scannet_long_demo.mp4" controls> </video> #### Multi-objects Interaction <video src="assets/multiobj_multistep_1.mp4" controls> </video> <video src="assets/multiobj_multistep_2.mp4" controls> </video> #### Diverse Interactions with the Same Object <video src="assets/multistep_sit_demo.mp4" controls> </video> <video src="assets/multistep_bed_demo.mp4" controls> </video> #### ''Multi-agent'' Interaction Planned by LLMs <video src="assets/scannet_two_bed_demo.mp4" controls> </video> -->Training and Evaluation
We provide configs for different tasks here and you can run the train and test script in the tools folder for training and inference. For example, to train a multi-view 3D detection model with pytorch, just run:
# Single GPU training
python tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py --work-dir=work_dirs/mv-3ddet
# Multiple GPU training
python tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py --work-dir=work_dirs/mv-3ddet --launcher="pytorch"
Or on the cluster with multiple machines, run the script with the slurm launcher following the sample script provided here.
NOTE: To run the multi-view 3D grounding experiments, please first download the 3D detection pretrained model to accelerate its training procedure. After downloading the detection checkpoint, please check the path used in the config, for example, the load_from here, is correct.
To inference and evaluate the model (e.g., the checkpoint work_dirs/mv-3ddet/epoch_12.pth), just run the test script:
# Single GPU testing
python tools/test.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py work_dirs/mv-3ddet/epoch_12.pth
# Multiple GPU testing
python tools/test.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py work_dirs/mv-3ddet/epoch_12.pth --launcher="pytorch"
Using Visualizer during inference
We provide EmbodiedScanBaseVisualizer to visualize the output of models during inference. Please refer to the [gui
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
research_rules
Research & Verification Rules Quote Verification Protocol Primary Task "Make sure that the quote is relevant to the chapter and so you we want to make sure that we want to have it identifie
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

