3DGraphLLM
[ICCV 2025] 3DGraphLLM is a model that uses a 3D scene graph and an LLM to perform 3D vision-language tasks.
Install / Use
/learn @CognitiveAISystems/3DGraphLLMREADME
3DGraphLLM
In this work, we propose 3DGraphLLM, a method for constructing a learnable representation of a 3D scene graph, which serves as input for LLMs to perform 3D vision-language tasks.
<p align="center"> <img src="assets/ga.png" width="80%"> </p>News
[2025.6] We are pleased to inform you that our paper has been accepted for poster presentation at ICCV 2025! 🎉
[2024.12] We release 3DGraphLLM pre-training on GT instance segmentation scene graphs
[2024.12] We release 3DGraphLLM paper code
🔥 Semantic relations boost LLM performance on 3D Referred Object Grounding and Dense Scene Captioning tasks
| | ScanRefer | | Multi3dRefer | | Scan2Cap | | ScanQA | | SQA3D | |:----: |:---------: |:-------: |:------: |:------: |:---------: |:----------: |:------------: |:------: |:-----: | | | Acc@0.25 | Acc@0.5 | F1@0.25 | F1@0.5 | CIDEr@0.5 | B-4@0.5 | CIDEr | B-4 | EM | | Chat-Scene | 55.5 | 50.2 | 57.1 | 52.3 | 77.1 | 36.3 | 87.7 | 14.3 | <ins>54.6</ins> | | <ins>3DGraphLLM Vicuna-1.5 </ins> | <ins>58.6</ins> | <ins>53.0</ins> | <ins>61.9</ins> | <ins>57.3</ins> | <ins>79.2</ins> | <ins>34.7</ins> | <ins>91.2</ins> | 13.7 | 55.1 | 3DGraphLLM LLAMA3-8B | 62.4 | 56.6 | 64.7 | 59.9 | 81.0 | 36.5 | 88.8 | <ins>15.9</ins> | 55.9 |
🔨 Preparation
-
Prepare the environment:
conda create -n 3dgraphllm python=3.9.17 conda activate 3dgraphllm conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia pip install -r requirements.txt -
If you don't have root permissions to install java (needed for pycocoeval scripts for metrics such as BLEU and CIDER), install it with conda:
conda install -c conda-forge openjdk
-
Download LLM backbone:
-
We use LLAMA3-8B-Instruct in our experiments, which can be downloaded from Hugging Face.
-
Change the
llama_model_pathin config.py to the path ofLLAMA3-8B-Instruct.
-
-
Annotations and extracted features:
Please follow the instructions in preprocess.
🤖 Training and Inference
-
Pre-training on GT instance segmentation scene graphs.
-
Modify run_gt_pretrain.sh:
<details> <summary> Explanation of "train_tag" and "val_tag" </summary>train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align" val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref" evaluate=False-
Use
#to seperate different datasets -
Datasets:
-
-
Run:
bash scripts/run_gt_pretrain.sh
-
-
Training
- Modify run.sh:
train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align" val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref" evaluate=False pretrained_path="outputs/llama3-8b-gt-pretrain-2/ckpt_00_28927.pth" - Run:
bash scripts/run.sh
- Modify run.sh:
-
Inference
-
Modify run.sh:
val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref" evaluate=True pretrained_path="/path/to/pretrained_model.pth" -
Run:
bash scripts/run.sh
-
🚀 Demo
- Run:
bash demo/run_demo.sh. You will be prompted to ask different queries about Scene 435 of ScanNet.
📪 Contact
If you have any questions about the project, please open an issue in this repository or send an email to Tatiana Zemskova.
📑 Citation
If you find this work helpful, please consider citing our work as:
@misc{zemskova20243dgraphllm,
title={3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding},
author={Tatiana Zemskova and Dmitry Yudin},
year={2024},
eprint={2412.18450},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.18450},
}
😊 Acknowledgement
Thanks to the open source of the following projects:
