SpatialEval
[NeurIPS'24] SpatialEval: a benchmark to evaluate spatial reasoning abilities of MLLMs and LLMs
Install / Use
/learn @jiayuww/SpatialEvalQuality Score
Category
Education & ResearchSupported Platforms
README
SpatialEval
Welcome to the official codebase for Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models.
📌 Quick Links
💥 News 💥
-
[2024.09.25] 🎉 SpatialEval has been accepted to NeurIPS 2024!
-
[2024.09.16] 🌟 SpatialEval has been included in Eureka from Microsoft Research!
-
[2024.06.21] 📢 SpatialEval is now publicly available on arXiv
🤔 About SpatialEval
SpatialEval is a comprehensive benchmark for evaluating spatial intelligence in LLMs and VLMs across four key dimensions:
- Spatial relationships
- Positional understanding
- Object counting
- Navigation
Benchmark Tasks
- Spatial-Map: Understanding spatial relationships between objects in map-based scenarios
- Maze-Nav: Testing navigation through complex environments
- Spatial-Grid: Evaluating spatial reasoning within structured environments
- Spatial-Real: Assessing real-world spatial understanding
Each task supports three input modalities:
- Text-only (TQA)
- Vision-only (VQA)
- Vision-Text (VTQA)

🚀 Quick Start
📍 Load Dataset
SpatialEval provides three input modalities—TQA (Text-only), VQA (Vision-only), and VTQA (Vision-text)—across four tasks: Spatial-Map, Maze-Nav, Spatial-Grid, and Spatial-Real. Each modality and task is easily accessible via Hugging Face. Ensure you have installed the packages:
from datasets import load_dataset
tqa = load_dataset("MilaWang/SpatialEval", "tqa", split="test")
vqa = load_dataset("MilaWang/SpatialEval", "vqa", split="test")
vtqa = load_dataset("MilaWang/SpatialEval", "vtqa", split="test")
📈 Evaluate SpatialEval
SpatialEval supports any evaluation pipelines compatible with language models and vision-language models. For text-based prompts, use the text column with this structure:
{text} First, provide a concise answer in one sentence. Then, elaborate on the reasoning behind your answer in a detailed, step-by-step explanation. The image input is in the image column, and the correct answers are available in the oracle_answer, oracle_option, and oracle_full_answer columns.
Next, we provide full scripts for inference and evaluation.
Install
- Clone this repository
git clone git@github.com:jiayuww/SpatialEval.git
- Install dependencies
To run models like LLaVA and Bunny, install LLaVA and Bunny. Install fastchat for language model inference. For Bunny variants, ensure you merge LoRA weights into the base LLMs before initiation.
💬 Running Inference
For language models, for example, to run on Llama-3-8B for all four tasks:
# Run on all tasks
python inference_lm.py \
--task "all" \
--mode "tqa" \
--w_reason \
--model-path "meta-llama/Meta-Llama-3-8B-Instruct" \
--output_folder outputs \
--temperature 0.2 \
--top_p 0.9 \
--repetition_penalty 1.0 \
--max_new_tokens 512 \
--device "cuda"
# For specific tasks, replace "all" with:
# - "spatialmap"
# - "mazenav"
# - "spatialgrid"
# - "spatialreal"
For vision-language models, for example, to run LLaVA-1.6-Mistral-7B across all tasks:
# VQA mode
python inference_vlm.py \
--mode "vqa" \
--task "all" \
--model_path "liuhaotian/llava-v1.6-mistral-7b" \
--w_reason \
--temperature 0.2 \
--top_p 0.9 \
--repetition_penalty 1.0 \
--max_new_tokens 512 \
--device "cuda"
# For VTQA mode, use --mode "vtqa"
Example bash scripts are available in the scripts/ folder. For more configurations, see configs/inference_configs.py. VLMs support tqa, vqa, and vtqa modes, while LMs support tqa only. Tasks include all four tasks or individual tasks like spatialmap, mazenav, spatialgrid, and spatialreal.
We can also test the first k examples, for exmaple, first 100 samples for each question type in each task by specifying --first_k 100.
📊 Evaluation
We use exact match for evaluation. For example, to evaluate Spatial-Map task on all three input modalities TQA, VQA and VTQA:
# For TQA on Spatial-Map
python evals/evaluation.py --mode 'tqa' --task 'spatialmap' --output_folder 'outputs/' --dataset_id 'MilaWang/SpatialEval' --eval_summary_dir 'eval_summary'
# For VQA on Spatial-Map
python evals/evaluation.py --mode 'vqa' --task 'spatialmap' --output_folder 'outputs/' --dataset_id 'MilaWang/SpatialEval' --eval_summary_dir 'eval_summary'
# For VTQA on Spatial-Map
python evals/evaluation.py --mode 'vtqa' --task 'spatialmap' --output_folder 'outputs/' --dataset_id 'MilaWang/SpatialEval' --eval_summary_dir 'eval_summary'
Evaluation can also be configured for other tasks mazenav, spatialgrid, and spatialreal. Further details are in evals/evaluation.py.
💡 Dataset Generation Script
Stay tuned! The dataset generation script will be released in Feburary 😉
⭐ Citation
If you find our work helpful, please consider citing our paper 😊
@inproceedings{wang2024spatial,
title={Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models},
author={Wang, Jiayu and Ming, Yifei and Shi, Zhenmei and Vineet, Vibhav and Wang, Xin and Li, Yixuan and Joshi, Neel},
booktitle={The Thirty-Eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}
💬 Questions
Have questions? We're here to help!
- Open an issue in this repository
- Contact us through the channels listed on our project page
Related Skills
async-pr-review
99.6kTrigger this skill when the user wants to start an asynchronous PR review, run background checks on a PR, or check the status of a previously started async PR review.
ci
99.6kCI Replicate & Status This skill enables the agent to efficiently monitor GitHub Actions, triage failures, and bridge remote CI errors to local development. It defaults to automatic replication
code-reviewer
99.6kCode Reviewer This skill guides the agent in conducting professional and thorough code reviews for both local development and remote Pull Requests. Workflow 1. Determine Review Target
docs-writer
99.6k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
