SORT3D
SORT3D, an LLM-based object-centric grounding and indoor navigation system employing a spatial reasoning toolbox and state of the art 2D VLMs for perception.
Install / Use
/learn @nzantout/SORT3DREADME
We propose SORT3D, an LLM-based object-centric grounding and indoor navigation system employing a spatial reasoning toolbox and state-of-the-art 2D VLMs for perception. The toolbox is capable of interpreting both direct and indirect statements about spatial relations, using an LLM for high-level reasoning and guiding the autonomous robot to navigate through the environment. It has demonstrated the best zero-shot performance on spatial reasoning benchmarks. To the best of our knowledge, this is the first implementation of a general spatial relation toolbox for autonomous vision-language navigation that is fully integrated into real-robot systems.
<div align="center"><img src="media/diagram.png" alt="SORT3D Diagram" width="99%"></div>
https://github.com/user-attachments/assets/20865dc0-1ffc-4d72-9975-508687dbbe76
This repository is set up to run both grounding evaluation on the ReferIt3D and VLA-3D benchmarks and online navigation, on both real robots and provided simulated environments. We also provide a dataset of Scannet object crops and captions generated using our pipeline.
Updates
- [2025-06] SORT3D is accepted to IROS 2025!
- [2025-03] We release SORT3D for offline grounding and online object-centric navigation.
Table of Contents
- Repository Structure
- Data
- System Requirements
- SORT3D-Bench: Setup
- SORT3D-Bench: Usage
- SORT3D-Nav: Setup
- 0) Cloning Repo and Recommended Installation Method
- 1) Docker Installation (Recommended)
- 2) Pulling and Preparing Docker Image
- 3a) Building ROS Humble System with Wheelchair Simulator
- 3b) Building ROS Noetic System with Wheelchair Simulator (Ubuntu 22.04)
- 3c) Building ROS Humble System with Mecanum Simulator
- (Optional) Installing ROS Humble System Dependencies Without Docker
- (Optional) Installing ROS Noetic System Dependencies Without Docker
- SORT3D-Nav: Usage
- Troubleshooting
- Citation
Repository Structure
SORT3D has two major versions:
- SORT3D-Bench: The version of SORT3D used to run the ReferIt3D and the IRef-VLA benchmarks.
- SORT3D-Nav: The version of SORT3D used to run navigation on our robot platforms, built on top of our base autonomy stack. SORT3D is deployed on two research platforms:
- Our wheelchair-base robot (wheelchair), for which we have both ROS Noetic and ROS Humble versions.
- Our mecanum-wheeled robot (mecanum), for which we have a ROS Humble version.
| Platform | ROS Version | Branch | Simulation Available | Live Demo Available (Using ROS Bag) | Ground Truth Semantics Available | Semantic Mapping Module Available |
|---|---|---|---|---|---|---|
| Benchmark | - | humble-wheelchair | ☑️ | - | ☑️ | - |
| Wheelchair | Noetic | humble-wheelchair | ☑️ | ☑️ | ☑️ | ☑️ |
| Wheelchair | Humble | noetic-wheelchair | ☑️ | ☑️ | ☑️ | ☑️ |
| Mecanum | Humble | humble-mecanum | ☑️ | ☑️ | ☑️ | ☑️ |
Data
Dataset For SORT3D-Bench
To run SORT3D-Bench, ensure the following three datasets are downloaded and unzipped:
-
Object Captions Dataset: For our benchmark, we have pregenerated 2D object crops and captions using our captioning system and Qwen2.5-VL. To download, first install boto3 and tqdm:
pip install boto3 tqdmThen run
python data/download_crops_dataset.py --download_path dataThe data will be downloaded as a zip file in
data/. Unzip the file directly intodata, the path to the unzipped folder should bedata/captions. -
IRef-VLA Scannet: We use the processed pointclouds in IRef-VLA for our benchmark. Follow the instructions in the repo and download only the Scannet subset of the data:
python download_dataset.py --download_path data/IRef-VLA --subset scannetAfterwards, unzip Scannet.zip into
data/IRef-VLA. The folder structure should bedata/IRef-VLA/Scannet. -
ReferIt3D: We provide the subsets of ReferIt3D used for the benchmark in
data/referit3d.
Extract the IRef-VLA and the captions data into the same folder. The final folder structure should look like so:
data/<br> IRef-VLA/<br> Scannet/<br> scene0000_00<br> instance_crops<br> scene0000_00_free_space_pc_result.ply<br> scene0000_00_...<br> scene0000_01<br> instance_crops<br> scene0000_00_free_space_pc_result.ply<br> scene0000_00_...<br> ...<br> referit3d/<br>
ROS Bag Files for SORT3D-Nav
We provide ROS bag files for both the wheelchair and mecanum platforms. To download, install boto3 and tqdm:
pip install boto3 tqdm
Then run
python data/download_rosbag.py --download_path bagfiles --platform [wheelchair|mecanum]
while making sure to pick the correct platform. Each ROS bag will be downloaded as a zip file in bagfiles/. Unzip the bag files into your directory of choice before replaying them. The wheelchair bag file is currently available, with the mecanum-wheeled robot bag file upcoming with the release of the mecanum version of SORT3D-Nav.
System Requirements
Hardware Requirements
SORT3D-Nav has been deployed on an Nvidia RTX 4090 with 24GB of VRAM to run the live captioning model on the wheelchair, and on an Nvidia RTX 4090 with 16GB of VRAM to run the live captioning model on the mecanum-wheeled robot. The system requires a minimum of:
- 10GB of VRAM to run the semantic mapping module along with live captioning.
- 7GB of VRAM to run using ground truth semantics with live captioning.
If you have more VRAM, you may increase the captioner_batch_size in the run scripts to get faster captioning throughput (and vice versa).
The language planner additionally requires a WiFi connection on the robot to conne
