<h1 align="center">SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models</h1> <div align="center" margin-bottom="1em"> <a href="https://nzantout.github.io">Nader Zantout<sup>✶</sup></a>, <a href="https://HaochenZ11.github.io">Haochen Zhang<sup>✶</sup></a>, <a href="https://sites.google.com/view/pujith-kachana/">Pujith Kachana</a>, <a href="https://www.jinkaiq.com/">Jinkai Qiu</a>, <a href="https://gfchen01.cc/">Guofei Chen</a>, <a href="https://frc.ri.cmu.edu/~zhangji/">Ji Zhang</a>, <a href="http://www.wangwenshan.com/">Wenshan Wang</a> <br> <sup>* </sup>Equal contribution<br> </div>   <div align="center" margin-bottom="1em"> <a href="https://arxiv.org/abs/2504.18684" target="_blank"> <img src="https://img.shields.io/badge/Paper-arXiv-deepgreen" alt="Paper arXiv"></a> <a href="https://youtu.be/Jhd_ThwBSGo" target="_blank"> <img src="https://img.shields.io/badge/Video-YouTube-9966ff" alt="Video"></a> </div>  

We propose SORT3D, an LLM-based object-centric grounding and indoor navigation system employing a spatial reasoning toolbox and state-of-the-art 2D VLMs for perception. The toolbox is capable of interpreting both direct and indirect statements about spatial relations, using an LLM for high-level reasoning and guiding the autonomous robot to navigate through the environment. It has demonstrated the best zero-shot performance on spatial reasoning benchmarks. To the best of our knowledge, this is the first implementation of a general spatial relation toolbox for autonomous vision-language navigation that is fully integrated into real-robot systems.

https://github.com/user-attachments/assets/20865dc0-1ffc-4d72-9975-508687dbbe76

This repository is set up to run both grounding evaluation on the ReferIt3D and VLA-3D benchmarks and online navigation, on both real robots and provided simulated environments. We also provide a dataset of Scannet object crops and captions generated using our pipeline.

Updates

[2025-06] SORT3D is accepted to IROS 2025!
[2025-03] We release SORT3D for offline grounding and online object-centric navigation.

Repository Structure
Data
- Dataset For SORT3D-Bench
- ROS Bag Files for SORT3D-Nav
System Requirements
- Hardware Requirements
- Operating System
SORT3D-Bench: Setup
- 1) Conda Environment
- 2) Dataset Setup
SORT3D-Bench: Usage
SORT3D-Nav: Setup
SORT3D-Nav: Usage
Troubleshooting
Citation

Repository Structure

SORT3D has two major versions:

SORT3D-Bench: The version of SORT3D used to run the ReferIt3D and the IRef-VLA benchmarks.
SORT3D-Nav: The version of SORT3D used to run navigation on our robot platforms, built on top of our base autonomy stack. SORT3D is deployed on two research platforms:
1. Our wheelchair-base robot (wheelchair), for which we have both ROS Noetic and ROS Humble versions.
2. Our mecanum-wheeled robot (mecanum), for which we have a ROS Humble version.

<p align="center"> <img src="media/mecanum_wheel.jpg" height="300" /> <img src="media/wheelchair.jpg" height="300" /> </p>   This repository contains a separate branch for each platform and each ROS version SORT3D-Nav is deployed on. The SORT3D-Bench script is included in the `humble-wheelchair` branch. Each version of SORT3D-Nav is accompanied with a unity-based simulator and a ROS bag recording of the office areas the live demonstrations were recorded in. Additionally, we provide launch scripts of SORT3D-Nav using both ground truth semantic segmentations and our live semantic mapping module. The table below summarizes the currently available systems and their respective branches:

| Platform | ROS Version | Branch | Simulation Available | Live Demo Available (Using ROS Bag) | Ground Truth Semantics Available | Semantic Mapping Module Available | |---|---|---|---|---|---|---| | Benchmark | - | humble-wheelchair | ☑️ | - | ☑️ | - | | Wheelchair | Noetic | humble-wheelchair | ☑️ | ☑️ | ☑️ | ☑️ | | Wheelchair | Humble | noetic-wheelchair | ☑️ | ☑️ | ☑️ | ☑️ | | Mecanum | Humble | humble-mecanum | ☑️ | ☑️ | ☑️ | ☑️ |

Data

Dataset For SORT3D-Bench

To run SORT3D-Bench, ensure the following three datasets are downloaded and unzipped:

Object Captions Dataset: For our benchmark, we have pregenerated 2D object crops and captions using our captioning system and Qwen2.5-VL. To download, first install boto3 and tqdm:
```
pip install boto3 tqdm
```
Then run
```
python data/download_crops_dataset.py --download_path data
```
The data will be downloaded as a zip file in data/. Unzip the file directly into data, the path to the unzipped folder should be data/captions.
IRef-VLA Scannet: We use the processed pointclouds in IRef-VLA for our benchmark. Follow the instructions in the repo and download only the Scannet subset of the data:
```
python download_dataset.py --download_path data/IRef-VLA --subset scannet
```
Afterwards, unzip Scannet.zip into data/IRef-VLA. The folder structure should be data/IRef-VLA/Scannet.
ReferIt3D: We provide the subsets of ReferIt3D used for the benchmark in data/referit3d.

Extract the IRef-VLA and the captions data into the same folder. The final folder structure should look like so:

data/<br> IRef-VLA/<br> Scannet/<br> scene0000_00<br> instance_crops<br> scene0000_00_free_space_pc_result.ply<br> scene0000_00_...<br> scene0000_01<br> instance_crops<br> scene0000_00_free_space_pc_result.ply<br> scene0000_00_...<br> ...<br> referit3d/<br>

ROS Bag Files for SORT3D-Nav

We provide ROS bag files for both the wheelchair and mecanum platforms. To download, install boto3 and tqdm:

pip install boto3 tqdm

Then run

python data/download_rosbag.py --download_path bagfiles --platform [wheelchair|mecanum]

while making sure to pick the correct platform. Each ROS bag will be downloaded as a zip file in bagfiles/. Unzip the bag files into your directory of choice before replaying them. The wheelchair bag file is currently available, with the mecanum-wheeled robot bag file upcoming with the release of the mecanum version of SORT3D-Nav.

System Requirements

Hardware Requirements

SORT3D-Nav has been deployed on an Nvidia RTX 4090 with 24GB of VRAM to run the live captioning model on the wheelchair, and on an Nvidia RTX 4090 with 16GB of VRAM to run the live captioning model on the mecanum-wheeled robot. The system requires a minimum of:

10GB of VRAM to run the semantic mapping module along with live captioning.
7GB of VRAM to run using ground truth semantics with live captioning.

If you have more VRAM, you may increase the captioner_batch_size in the run scripts to get faster captioning throughput (and vice versa).

The language planner additionally requires a WiFi connection on the robot to conne

SORT3D

Install / Use

README