SeeTrek

See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model [NeurIPS 2025]

Generate Convert Improve

Install / Use

/learn @Hoantrbl/SeeTrek

About this skill

Quality Score

0/100

README

See&Trek

See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Models

Official codebase for <u>See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Models</u>

🎉 Our paper has been accepted to NeurIPS 2025.

📝 Abstract

See&Trek is a training-free and GPU-free spatial prompting framework designed to fundamentally enhance the spatial reasoning capabilities of Multimodal Large Language Models (MLLMs). It addresses two key bottlenecks in existing MLLMs: (1) visual homogeneity caused by uniform frame sampling, and (2) unknown motion due to missing ego-motion cues.

See&Trek achieves this by (i) extracting semantic-rich keyframes using off-the-shelf perception models and (ii) reconstructing camera trajectories via Visual Odometry to annotate keyframes with explicit motion information. Without any model modification or fine-tuning, See&Trek injects structured spatial–temporal priors into MLLMs through a single forward pass, leading to robust improvements across spatial reasoning tasks.

🌟 Highlights

🚫 Training-free & GPU-free spatial prompting
🔌 Plug-and-play for all open-source and commercial MLLMs
⚡ Single-forward inference with zero architecture changes
🎥 Semantic-rich keyframe selection + reconstructed motion cues
📈 Consistent gains on VSI-Bench and STI-Bench

📦 Installation

Follow the official setup for VSI-Benchmark: https://github.com/vision-x-nyu/thinking-in-space

conda create --name vsibench python=3.10
conda activate vsibench

git clone git@github.com:vision-x-nyu/thinking-in-space.git
cd thinking-in-space

git submodule update --init --recursive

cd transformers && pip install -e . && cd ..

pip install -e .
pip install s2wrapper@git+https://github.com/bfshi/scaling_on_scales
pip install deepspeed

Core dependencies:

transformers==4.52.0
lmms_eval==0.2.3
torch==2.6.0
torchvision==0.21.0

Install Ultralytics YOLO:

pip install ultralytics

🚀 Running See&Trek

1️⃣ Generate Prior Semantics

Make sure dataset paths (e.g., directory, save_dir) in tools/multi-proc-video-skip.py are correctly configured.

Run:

python ./tools/multi-proc-video-skip.py

Outputs are saved to:

./tools/SeeTrek/modified_dataset_yolov8n_2hz_every_4_frames

We also provide preprocessed prior semantics on Google Drive: 🔗 https://drive.google.com/drive/folders/1X3ZpaO8H59H-OxoEturHi2AwWHpm8FMu?usp=drive_link

2️⃣ Run Evaluation

Place model checkpoints (e.g., InternVL3-Series) into checkpoint-local and update paths in evaluate.sh.

Then run:

bash evaluate.sh --model all --num_processes 2

--num_processes controls how many GPUs to use.

📚 Citation

If you find See&Trek useful, please cite:

@article{li2025see,
  title={See\&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model},
  author={Li, Pengteng and Song, Pinhao and Li, Wuyang and Guo, Weiyu and Yao, Huizai and Xu, Yijie and Liu, Dugang and Xiong, Hui},
  journal={arXiv preprint arXiv:2509.16087},
  year={2025}
}

Related Skills

node-connect

346.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

107.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

346.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

346.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。