<div align="center"> <h1>Language-to-Space Programming for Training-Free 3D Visual Grounding</h1>

Boyu Mi, Hanqing Wang, Tai Wang, Yilun Chen, Jiangmiao Pang

Shanghai Artificial Intelligence Laboratory

EMNLP 2025

</div> <div align="center">

</div>

Environment Installation

pip install -r requirements.txt

Set your openai api key:

export OPENAI_API_KEY=your_api_key

Data Preparation

The data/ dir should be organized as follows:

data
├── frames
│   ├── color
│   │   ├── 0.png
│   │   ├── 20.png
│   │   └── ...
├── referit3d
│   ├── annotations
│   ├── scan_data
├── symoblic_exp
│   ├── nr3d.jsonl
│   ├── scanrefer.json
├── test_data
│   ├── above
│   ├── behind
│   ├── ...
├── seg
├── nr3d_masks
├── scanrefer_masks
├── feats_3d.pkl
├── tables.pkl

frames: RGB images of the scenes. download_link
referit3d: processed referit3d dataset from vil3dref
symbolic_exp: symbolic expressiones.
test_data: test data for code generation.
seg: segmentation results of 3D point clouds for ScanRefer. download_link
nr3d_masks: 2D GT object masks. download_link
scanrefer_masks: 2D predicted object masks. download_link
feats_3d.pkl: predicted object labels for Nr3D from ZSVG3D
tables.pkl: tables for code generation. download_link

huggingface dataset

(Optional) Relation Encoder Generation

Run src/relation_encoders/run_optim.py to generate relation encoders for 6 relations: left, right, between, corner, above, below, behind.

After the optimization is done, you will get the relation encoders and their accuracy on test cases under data/test_data/{relation_name}/trajs. Then you can select the best relation encoders for evaluation. You can also use the provided relation encoders in src/relation_encoders.

(Optional) Features Computation

python -m src.relation_encoders.compute_features \
    --dataset scanrefer \
    --output $OUTPUT_DIR \
    --label pred

--dataset option can be scanrefer or nr3d. The --label option can be gt or pred. Now we only support the pred label for ScanRefer because there is no GT label in standard evaluation protocols.

After running, you will get features in .pth format in the $OUTPUT_DIR directory.

You can also download our prepared features: nr3d(pred label) nr3d(gt label) scanrefer

Evaluation

Nr3d Evaluation:

python -m src.eval.eval_nr3d \
    --features_path output/nr3d_features_per_scene_pred_label.pth \
    --top_k 5 \
    --threshold 0.9 \
    --label_type pred \
    --use_vlm

ScanRefer Evaluation:

python -m src.eval.eval_scanrefer \
    --features_path output/scanrefer_features_per_scene.pth \
    --top_k 5 \
    --threshold 0.1 \
    --use_vlm

Change features_path and label_type if you'd like to evaluate on the ground truth labels. Set --use_vlm, --top_k and threshold to use the VLM model for evaluation. Please refer to our paper for the meanings of these parameters.

Acknowledgement

Thank following repositories for their contributions:

LaSP

Install / Use

README