VisuoThink
[Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics]: VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
Install / Use
/learn @ekonwang/VisuoThinkREADME
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
<!-- <div align="center"> [💻 Code](https://github.com/ekonwang/VisuoThink) | [📃 Paper](https://arxiv.org/abs/2504.09130) | [🤗](https://huggingface.co/papers/2504.09130) </div> --> <div align="center"> <a href="https://github.com/ekonwang/VisuoThink">💻 Code</a> | <a href="https://arxiv.org/abs/2504.09130">📃 Paper</a> | <a href="https://huggingface.co/papers/2504.09130">🤗 Hugging Face</a> </div>Recent advancements in Large Vision-Language Models have showcased remarkable capabilities. However, they often falter when confronted with complex reasoning tasks that humans typically address through visual aids and deliberate, step-by-step thinking. While existing methods have explored text-based slow thinking or rudimentary visual assistance, they fall short of capturing the intricate, interleaved nature of human visual-verbal reasoning processes. To overcome these limitations and inspired by the mechanisms of slow thinking in human cognition, we introduce VisuoThink, a novel framework that seamlessly integrates visuospatial and linguistic domains. VisuoThink facilitates multimodal slow thinking by enabling progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search. Extensive experiments demonstrate that VisuoThink significantly enhances reasoning capabilities via inference-time scaling, even without fine-tuning, achieving state-of-the-art performance in tasks involving geometry and spatial reasoning.
<!--  -->
Quick Start
- Install the dependencies:
conda create -n visuothink python==3.10 -y
conda activate visuothink
pip install -r requirements.txt
- Set up the
config.pyfile undervisual-navigation/as follows:
import os
# set up the agent max reasoning steps
MAX_REPLY = 10
os.environ["AUTOGEN_USE_DOCKER"] = "False"
MODEL_NAME = os.environ.get("MODEL_NAME", "gpt-4o")
API_KEY = os.environ.get("OPENAI_API_KEY")
llm_config={"cache_seed": None, "config_list": [{"model": MODEL_NAME, "temperature": 0.0, "api_key": API_KEY}]}
- Run the visual navigation tasks:

- To run the visual navigation with VisuoThink, you can use the following command:
python visual-navigation/run_task_nav.py --verbose --visual
# alternatively, you can use the --tree_search flag to enable multimodal tree search
- To run the visual navigation with CoT with Executor, you can use the following command:
python visual-navigation/run_task_nav.py --verbose
- To run the visual navigation with CoT, you can use the following command:
python visual-navigation/run_task_nav.py --verbose --visual --run_tag cot
- To run the geometry tasks with VisuoThink (/wo rollout search), you can use the following command:
python geometry/solver.py
Benchmarks
<!-- - Visual Tiling and Geometry (Geometry3k & Geomverse)'s Google Drive links will be available soon. -->- (4.27 News) Geomverse Dataset has been released! See here for more details.
Citation
Please consider citing our paper and starring this repo if you find them helpful. Thank you!
@misc{wang2025visuothinkempoweringlvlmreasoning,
title={VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search},
author={Yikun Wang and Siyin Wang and Qinyuan Cheng and Zhaoye Fei and Liang Ding and Qipeng Guo and Dacheng Tao and Xipeng Qiu},
year={2025},
eprint={2504.09130},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.09130},
}
Related Skills
node-connect
344.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
99.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
