SkillAgentSearch skills...

3DGraphLLM

[ICCV 2025] 3DGraphLLM is a model that uses a 3D scene graph and an LLM to perform 3D vision-language tasks.

Install / Use

/learn @CognitiveAISystems/3DGraphLLM
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

3DGraphLLM

arXiv Huggingace

In this work, we propose 3DGraphLLM, a method for constructing a learnable representation of a 3D scene graph, which serves as input for LLMs to perform 3D vision-language tasks.

<p align="center"> <img src="assets/ga.png" width="80%"> </p>

News

[2025.6] We are pleased to inform you that our paper has been accepted for poster presentation at ICCV 2025! 🎉

[2024.12] We release 3DGraphLLM pre-training on GT instance segmentation scene graphs

[2024.12] We release 3DGraphLLM paper code

🔥 Semantic relations boost LLM performance on 3D Referred Object Grounding and Dense Scene Captioning tasks

| | ScanRefer | | Multi3dRefer | | Scan2Cap | | ScanQA | | SQA3D | |:----: |:---------: |:-------: |:------: |:------: |:---------: |:----------: |:------------: |:------: |:-----: | | | Acc@0.25 | Acc@0.5 | F1@0.25 | F1@0.5 | CIDEr@0.5 | B-4@0.5 | CIDEr | B-4 | EM | | Chat-Scene | 55.5 | 50.2 | 57.1 | 52.3 | 77.1 | 36.3 | 87.7 | 14.3 | <ins>54.6</ins> | | <ins>3DGraphLLM Vicuna-1.5 </ins> | <ins>58.6</ins> | <ins>53.0</ins> | <ins>61.9</ins> | <ins>57.3</ins> | <ins>79.2</ins> | <ins>34.7</ins> | <ins>91.2</ins> | 13.7 | 55.1 | 3DGraphLLM LLAMA3-8B | 62.4 | 56.6 | 64.7 | 59.9 | 81.0 | 36.5 | 88.8 | <ins>15.9</ins> | 55.9 |

🔨 Preparation

  • Prepare the environment:

    conda create -n 3dgraphllm python=3.9.17
    conda activate 3dgraphllm
    conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia
    pip install -r requirements.txt
    
  • If you don't have root permissions to install java (needed for pycocoeval scripts for metrics such as BLEU and CIDER), install it with conda:

conda install -c conda-forge openjdk
  • Download LLM backbone:

    • We use LLAMA3-8B-Instruct in our experiments, which can be downloaded from Hugging Face.

    • Change the llama_model_path in config.py to the path of LLAMA3-8B-Instruct.

  • Annotations and extracted features:

    Please follow the instructions in preprocess.

🤖 Training and Inference

  • Pre-training on GT instance segmentation scene graphs.

    • Modify run_gt_pretrain.sh:

      train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align"
      val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
      evaluate=False
      
      <details> <summary> Explanation of "train_tag" and "val_tag" </summary>
      • Use # to seperate different datasets

      • Datasets:

        • scanrefer: ScanRefer Dataset
        • scan2cap: Scan2Cap Dataset
        • scanqa: ScanQA Dataset
        • sqa3d: SQA3D Dataset
        • multi3dref: Multi3dRefer Dataset
        • nr3d_caption: A captioning dataset originated from Nr3D.
        • obj_align: A dataset originated from ScanRefer to align the object identifiers with object tokens.
      </details>
    • Run: bash scripts/run_gt_pretrain.sh

  • Training

    • Modify run.sh:
      train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align"
      val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
      evaluate=False
      pretrained_path="outputs/llama3-8b-gt-pretrain-2/ckpt_00_28927.pth"
      
    • Run: bash scripts/run.sh
  • Inference

    • Modify run.sh:

      val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
      evaluate=True
      pretrained_path="/path/to/pretrained_model.pth"
      
    • Run: bash scripts/run.sh

🚀 Demo

  • Run: bash demo/run_demo.sh. You will be prompted to ask different queries about Scene 435 of ScanNet.

📪 Contact

If you have any questions about the project, please open an issue in this repository or send an email to Tatiana Zemskova.

📑 Citation

If you find this work helpful, please consider citing our work as:

@misc{zemskova20243dgraphllm,
      title={3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding}, 
      author={Tatiana Zemskova and Dmitry Yudin},
      year={2024},
      eprint={2412.18450},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.18450}, 
}

😊 Acknowledgement

Thanks to the open source of the following projects:

Chat-Scene

View on GitHub
GitHub Stars111
CategoryDevelopment
Updated11d ago
Forks8

Languages

Python

Security Score

100/100

Audited on Mar 28, 2026

No findings