CityAnchor
[ICLR'25] City-scale 3D Visual Grounding with Multi-modality LLMs
Install / Use
/learn @WHU-USI3DV/CityAnchorREADME
CityAnchor: City-scale 3D Visual Grounding with Multi-modality LLMs
This is the official PyTorch implementation of CityAnchor.
Abstract

We have provided a Colab template for quick and easy access to CityAnchor! Please click it.
Schedule
To facilitate related research, we plan to make CityAnchor open source, including but not limited to the following:
- [x] Create an easy-to-use demo. Everyone can use CityAnchor!
- [x] Provide the weights of pre-trained CityAnchor model (7B).
- [ ] Provide the weights of pre-trained CityAnchor model (13B).
- [x] Release the codes inlcuding training and evaluation scripts.
- [x] Release the training data and evaluation data.
- [x] Expand CityAnchor for more city-scale datasets.
- [x] Create easy-to-follow dataloader script for grounding on your own dataset.
- [ ] Achieve more interesting work.
💾 Dataset Download and Processing
Skip the data preparation
We have provided all the prepared data in Google Drive. Please download the files and place them in the .\data directory, then change the corresponding path in .\lib\config.py. You’ll then be ready to use demo, train and test the model. Training data and evaluation data are available at meta data.
Prepare data
1) Please download the CityRefer dataset and organise its data structure as shown below.
CityAnchor
├── data
│ ├── cityrefer
│ │ ├── meta_data
│ │ │ ├── CityRefer_train.json
│ │ │ └── CityRefer_val.json
│ │ ├── box3d
│ │ │ └── [scene_id]_bbox.json
2) Please download the SensatUrban dataset and its segs data. Then, you should organize the *.ply and *.segs.json in scans folder as follows:
CityAnchor
├── data
│ ├── sensaturban
│ │ ├── scans
│ │ | |── birmingham_block_1
│ │ | | ├── birmingham_block_1.ply
│ │ | | ├── birmingham_block_1.segs.json
3) Perform data preprocessing and data augmentation (optional).
cd data/sensaturban
sh prepare_data.sh
4) Please use the pre-trained Uni3D-L model to extract 3D attribute features for each candidate object (In our work, 3D object attribute features in .json form are provided for convenience). Please download the top view map rasterized from RGB point cloud in .tif form. Please download the landmark features in .json form. Finally, you should organize them in pointgroup_data folder as follows:
CityAnchor
├── data
│ ├── sensaturban
│ │ ├── pointgroup_data
│ │ | |── balance_split
│ │ | | ├── random-50_crop-250
│ │ | | | ├── birmingham_block_1.json
│ │ | | | ├── birmingham_block_1.pth
│ │ | | | ├── birmingham_block_1.tif
│ │ | | | ├── birmingham_block_1_landmark.json
💻 Requirements
The code has been tested on:
- Ubuntu 20.04
- CUDA 12.2
- Python 3.10
- Pytorch 2.1.0
- NVIDIA A100 GPU (40G).
🔧 Installation
-
Create and activate the conda environment
conda create -n CityAnchor python=3.10 conda activate CityAnchor -
Install the necessary packages
pip install -r requirements.txt pip install deepspeed==0.15.1 pip install --upgrade gradio pip install --upgrade "jax[cuda12]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
🔦 Demo
You can try CityAnchor with the pre-trained CityAnchor model (and pre-trained ROI model) via following commands:
python Gradio_Demo_CityAnchor.py --version="PATH_TO_CityAnchor_MODEL" \
--version_stage_1="PATH_TO_ROI_MODEL"
After that, please choose the scene and input the object description (For example, "birmingham_block_5" and "The big oval green ground that surrounded by the oval brown running course that is nearby the Perry BarrGreyhound Stadium."). Then, you should obtain the object given your description.

🚅 Train
You can train the CityAnchor with pre-trained LLM(VLM) backbone. It takes only about 12 hours for training to achieve 3D visual grounding in urban scenes of more than 500m*500m and 400 objects. Note that RoI segmentation backbone model (SAM) is available at backbone model.
deepspeed --master_port=24999 Train_CityAnchor.py \
--dataset_dir='./dataset' \
--vision_pretrained="./sam_vit_h_4b8939.pth" \
--dataset="cityrefer" \ # "cityanchor" or "urbanbis-refer" or "whu-refer"
--sample_rates="1" \
--exp_name="CityAnchor_Train_Model_on_CityRefer_Dataset_v1" \
--epochs=6 \
--steps_per_epoch=200 \
--reason_seg_data='SensatUrban-LISA-EX|train'\
--explanatory=-1 \
--no_eval
When training process is finished, you should get the full model weight:
cd ./runs/CityAnchor_Train_Model_on_CityRefer_Dataset_v1/ckpt_model
python zero_to_fp32.py . ../pytorch_model.bin
Then, you need to merge the LoRA weights in "pytorch_model.bin", and save the final CityAnchor model into your desired path in the Hugging Face format:
CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
--version="PATH_TO_BASED_MODEL" \
--weight="PATH_TO_pytorch_model.bin" \
--save_path="PATH_TO_SAVED_MODEL"
Friendly reminder that we typically use the final weights xinlai/LISA-7B-v1(Recommended) or LLaVA-Lightning-7B-v1-1 as based model to be merged.
✏️ Evaluation
You can evaluate the grounding performances using the following commands. Please put the pre-trained model in folder /path/to/checkpoints.
CUDA_VISIBLE_DEVICES=0 python Test_CityAnchor_on_cityrefer_dataset.py \
--version="PATH_TO_CityAnchor_MODEL" \
--version_stage_1="PATH_TO_ROI_MODEL"
CUDA_VISIBLE_DEVICES=0 python Test_CityAnchor_on_cityanchor_dataset.py \
--version="PATH_TO_CityAnchor_MODEL" \
--version_stage_1="PATH_TO_ROI_MODEL"
The PTH on cityanchor dataset is available at Grounding Model and ROI Model.
CUDA_VISIBLE_DEVICES=0 python Test_CityAnchor_on_urbanbis_refer_dataset.py \
--version="PATH_TO_CityAnchor_MODEL" \
--version_stage_1="PATH_TO_ROI_MODEL"
The PTH on urbanbis-refer dataset is available at Grounding Model and ROI Model.
# Generalization test
CUDA_VISIBLE_DEVICES=0 python Test_CityAnchor_on_whu_refer_dataset.py \
--version="PATH_TO_CityAnchor_MODEL" \
--version_stage_1="PATH_TO_ROI_MODEL"
The PTH (trained on urbanbis-refer dataset) on whu-refer dataset is available at Grounding Model and ROI Model.
🤝 Acknowledgement
CityAnchor is built upon the extremely wonderful LISA, Uni3D, CityRefer and DeepSpeed.
Contact us
If you find this repo helpful, please give us a star. For any questions, please contact us via lijp57@whu.edu.cn.
