Mulberry
[NIPS'25 Spotlight] Mulberry, an o1-like Reasoning and Reflection MLLM Implemented via Collective MCTS
Install / Use
/learn @HJYao00/MulberryREADME
<a href='https://arxiv.org/abs/2412.18319'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/HuanjinYao/Mulberry_llava_8b'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'> <a href='https://huggingface.co/datasets/HuanjinYao/Mulberry-SFT'><img src='https://img.shields.io/badge/Dataset-Huggingface-yellow'>
<!--<a href='https://huggingface.co/collections/HuanjinYao/denseconnector-66500e173fc8c9f05dc98dea'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a> [](https://zhuanlan.zhihu.com/p/700000183) <a href='https://huggingface.co/spaces/HuanjinYao/DenseConnector-v1.5-8B'><img src='https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg'></a>-->Huanjin Yao<sup>2,3*</sup>, Jiaxing Huang<sup>1*✉️</sup>, Wenhao Wu<sup>3,5</sup>, Jingyi Zhang<sup>1</sup>, Yibo Wang<sup>2</sup>, Shunyu Liu<sup>1</sup>, Yingjie Wang<sup>1</sup>,
Yuxin Song<sup>3</sup>, Haocheng Feng<sup>3</sup>, Li Shen<sup>4</sup>, Dacheng Tao<sup>1</sup>
<sup>1</sup>Nanyang Technological University, <sup>2</sup>Tsinghua University, <sup>3</sup>Baidu, <sup>4</sup>SYSU, <sup>5</sup>Amazon AGI
<sup>*</sup>Equal Contribution, <sup>✉️</sup>Corresponding Author
</h5> </div> <!-- <details open><summary>📣 We also have other Reasoning MLLM projects that may interest you ✨. </summary><p> > [**Awesome-Reasoning-MLLM**](https://github.com/HJYao00/Awesome-Reasoning-MLLM)<br> > A curated collection of the most influential papers, code, dataset, benchmarks, and resources about **Reasoning** in Multi-Modal Large Language Models (MLLMs) <br> </p></details>-->News
- [x]
Sep 19, 2025.Mulberry has been accepted at NeurIPS 2025 as a spotlight! 🎉 - [x]
Feb 5, 2025.We release the evaluation code for Mulberry_llama_11b and Mulberry_qwen2vl_7b. - [x]
Feb 4, 2025.We release Mulberry_llama_11b model and Mulberry_qwen2vl_7b and their reasoning inference code. - [x]
Jan 26, 2025.We release Mulberry-260K step-by-step reasoning SFT data and training code. - [x]
Jan 14, 2025.We release the instructions and code for evaluating Mulberry-LLaVA-8B on different benchmarks through the VLMEvalKit tool. - [x]
Jan 08, 2025.We release the CoMCTS code for searching step-by-step reasoning and reflection data, along with the Mulberry-LLaVA-8B model and its reasoning inference code. - [x]
Dec 24, 2024.We release our paper in arxiv.
Reasoning Inference
We provide the inference code for running Mulberry models, which can output detailed step-by-step reasoning.
python infer.py \
--model 'Mulberry_llava_8b' \
--model_path 'HuanjinYao/Mulberry_llava_8b' \
--question 'Question: <Your_Question>' \
--img_path '<Your_Img_Path>'
<details>
<summary>You can also run the following command if you only require the final answer.</summary>
python infer.py \
--model 'Mulberry_llava_8b' \
--model_path 'HuanjinYao/Mulberry_llava_8b' \
--question 'Question: <Your_Question>' \
--img_path '<Your_Img_Path>' \
--only_output_final_answer
</details>
Data Constrution with CoMCTS
We release CoMCTS Code for generating reasoning and reflection data, which leverage collective knowledge from multiple models to collaboratively conjecture, search and identify effective reasoning paths toward correct answers via four iterative operations including Expansion, Simulation and Error Positioning, Backpropagation, and Selection.
Please refer here for more details.
After searching, you can use the code we provide to construct reasoning and reflection data. reflection_data_percentage is used to control the proportion of reflection data.
python data_construction.py \
--models gpt-4o qwen2_vl_7b qwen2_vl_72b llama_vision_11b \
--output_path <Your_output_path>/mulberry_data.json \
--data_path <CoMCTS_search_data_path> \
--reflection_data_percentage 0.1 \
Training
We use LLaMA-Factory to fine-tune the Mulberry models. We provide the training instructions and configs here.
First, install LLaMA-Factory according to the official_instruction.
Then, refer here and update the following customized dataset into dataset_info.json in LLaMA-Factory.
"mulberry": {
"file_name": "./mulberry_sft.json",
"formatting": "sharegpt",
"columns": {
"messages": "messages",
"images": "images"
},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant"
}
},
Finally, you can use the following command to train the models.
llamafactory-cli train examples/train_full/mulberry_llava_8b_full_sft.yaml
Evaluation
We use VLMEvalKit to evaluate the Mulberry models on different benchmarks. We provide the evaluation instructions and key code here.
First, you need to install VLMEvalKit according to the official instructions and replace image_vqa.py with ours in here.
Next, replace the llava.py file in VLMEvalKit-main/vlmeval/vlm/llava/ with the llava.py file we provide here.
Finally, you can use the following command to perform the evaluation.
python run.py --data MathVista_MINI --model llava_next_llama3 --verbose
Main Results
We conduct extensive experiments with four powerful baseline models, including LLaVA-Next-8b, LLaMA-3.2-Vision-11B-Instruct, Qwen2-VL-2B and Qwen2-VL-7B. The Main Results comparing the Mulberry models with other state-of-the-art models across several popular benchmarks are shown in the figure below.
<div align=center> <img width="650" alt="image" src="figure/main_results.png"> </div>Quantitative Results
Mulberry creates rich, explicit and well-defined reasoning steps with comprehensive understanding, ultimately arriving at the correct answer.
<div align=center> <img width="700" alt="image" src="figure/qualitative_results_reasoning.png"> </div>Citation
If you find this repository is useful, please star🌟 this repo and cite🖇️ our paper.
@article{yao2024mulberry,
title={Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search},
author={Yao, Huanjin and Huang, Jiaxing and Wu, Wenhao and Zhang, Jingyi and Wang, Yibo and Liu, Shunyu and Wang, Yingjie and Song, Yuxin and Feng, Haocheng and Shen, Li and others},
journal={arXiv preprint arXiv:2412.18319},
year={2024}
}
Acknowledgment
Our work is primarily based on the following codebases. We are sincerely grateful for their work.
- LLaMA-Factory: We use llama-factory to fine-tune Mulberry Models.
- VLMEvalKit: We use VLMEvalKit for evaluation.
Limitations
Mulberry is a preliminary exploration work in o1-like MLLM, leveraging Collective Monte Carlo Tree Search to enable effective and efficient reasoning-path searching and learning. CoMCTS leverages collective knowledge to significantly improve the search success rate and efficiency of reasoning path searches. By training on the reasoning data generated through CoMCTS, Mulberry has gained step-by-step reasoning capabilities, leading to a substantial improvement in overall performance. Nevertheless, certain limitations must be acknowledged.
Hallucinations in intermediate steps: Hallucinations are still prevalent in MLLMs, whether in closed or open-source models. For instance, the models may generate obvious errors in intermediate reasoning steps yet still arrive at the correct final answer in CoMCTS. Therefore, although we incorporated multiple models to better detect errors, some errors still persist in the intermediate steps because ensuring the c
Related Skills
node-connect
349.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
