VisuoThink

[Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics]: VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

Generate Convert Improve

Install / Use

/learn @ekonwang/VisuoThink

About this skill

Quality Score

0/100

README

VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

<div align="center"> <a href="https://github.com/ekonwang/VisuoThink">💻 Code</a> | <a href="https://arxiv.org/abs/2504.09130">📃 Paper</a> | <a href="https://huggingface.co/papers/2504.09130">🤗 Hugging Face</a> </div>

Recent advancements in Large Vision-Language Models have showcased remarkable capabilities. However, they often falter when confronted with complex reasoning tasks that humans typically address through visual aids and deliberate, step-by-step thinking. While existing methods have explored text-based slow thinking or rudimentary visual assistance, they fall short of capturing the intricate, interleaved nature of human visual-verbal reasoning processes. To overcome these limitations and inspired by the mechanisms of slow thinking in human cognition, we introduce VisuoThink, a novel framework that seamlessly integrates visuospatial and linguistic domains. VisuoThink facilitates multimodal slow thinking by enabling progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search. Extensive experiments demonstrate that VisuoThink significantly enhances reasoning capabilities via inference-time scaling, even without fine-tuning, achieving state-of-the-art performance in tasks involving geometry and spatial reasoning.

Quick Start

Install the dependencies:

conda create -n visuothink python==3.10 -y
conda activate visuothink

pip install -r requirements.txt

Set up the config.py file under visual-navigation/ as follows:

import os

# set up the agent max reasoning steps
MAX_REPLY = 10

os.environ["AUTOGEN_USE_DOCKER"] = "False"

MODEL_NAME = os.environ.get("MODEL_NAME", "gpt-4o")
API_KEY = os.environ.get("OPENAI_API_KEY")


llm_config={"cache_seed": None, "config_list": [{"model": MODEL_NAME, "temperature": 0.0, "api_key": API_KEY}]}

Run the visual navigation tasks:

To run the visual navigation with VisuoThink, you can use the following command:

python visual-navigation/run_task_nav.py --verbose --visual
# alternatively, you can use the --tree_search flag to enable multimodal tree search

To run the visual navigation with CoT with Executor, you can use the following command:

python visual-navigation/run_task_nav.py --verbose

To run the visual navigation with CoT, you can use the following command:

python visual-navigation/run_task_nav.py --verbose --visual --run_tag cot

To run the geometry tasks with VisuoThink (/wo rollout search), you can use the following command:

python geometry/solver.py

Benchmarks

(4.27 News) Geomverse Dataset has been released! See here for more details.

Citation

Please consider citing our paper and starring this repo if you find them helpful. Thank you!

@misc{wang2025visuothinkempoweringlvlmreasoning,
      title={VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search}, 
      author={Yikun Wang and Siyin Wang and Qinyuan Cheng and Zhaoye Fei and Liang Ding and Qipeng Guo and Dacheng Tao and Xipeng Qiu},
      year={2025},
      eprint={2504.09130},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.09130}, 
}

Related Skills

node-connect

344.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

99.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。