VisPlay
VisPlay: Self-Evolving Vision-Language Models
Install / Use
/learn @bruno686/VisPlayREADME
VisPlay: Self-Evolving Vision-Language Models
Reinforcement learning (RL) provides a principled framework for improving vision-language models (VLMs) on complex reasoning tasks. However, existing RL approaches often depend on human-annotated labels or task-specific heuristics to define verifiable rewards—both costly and limited in scalability. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning capabilities from massive unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained using Group Relative Policy Optimization (GRPO), which uses diversity and difficulty rewards to balance the difficulty of generated questions with the quality of silver answers. VisPlay scales efficiently across two model families. Trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks including MM-Vet and MMMU, and establishes a scalable path toward self-evolving multimodal intelligence.
<p align="center"> <img src="./assets/Visplay.png" width="80%"> </p>Requirements
The code base adopted from R-Zero and Vision-SR1.
Software Requirements
- Python 3.9+
- transformers=4.49.0
Self-Evolving Setup
git clone https://github.com/bruno686/VisPlay.git
cd VisPlay
conda create -n VisPlay python=3.11
bash setup.sh
# Set an environment variable for your storage path in every main script.
# This is a large directory where checkpoints and generated data will be saved.
export STORAGE_PATH="/path/to/your/storage"
export HUGGINGFACENAME="yourhuggingfacename"
mkdir -p \
"$STORAGE_PATH/evaluation" \
"$STORAGE_PATH/models" \
"$STORAGE_PATH/generated_question" \
"$STORAGE_PATH/temp_results"
Self-Play Training Scripts
bash scripts_Qwen-VL-3B/main.sh
bash scripts_Qwen-VL-7B/main.sh
bash scripts_MIMO-VL-7B/main.sh
Evaluation & LLM-as-a-Judge Evaluation
We use ChatGLM-flash as the Judge. Different LLM judges will result in different evaluation results. For reference, we also comput the rule-based evaluation accuracies, which is lower than LLM-as-Judges on Math datasets.
-
Prepare Benchmark
All we benchmark are from zli12321/datasets, you can directly download them. Ensuring consistency in evaluation is crucial.
-
Generate responses from trained LLM
We provide all the historic LLM generations for a quick reference and access to the results
bash validation_examples/eval_gen_questions.sh $experiment_name $your_model_path
For example:
bash validation_examples/eval_gen_questions.sh MIMO-VL-7B-solver_v3 /your_path/vr-zero/storage/models/MiMo-VL-7B-SFT_solver_v3/global_step_20/actor/huggingface
-
Use LLM-as-a-judge to generate result
bash Evaluation/eval.sh
Notes
To facilitate your further review of our experiments, I've made our WandB logs publicly available. However, please note that these logs may be incomplete and may not include all iterations. Additionally, the actual number of training steps completed per iteration might not be fully recorded.
Because, as mentioned in other issues, server limitations forced us to upload manually, so omissions or incorrect curves may exist (only a few—I haven't cleaned them yet, because you know, sometimes parameters are wrong but get uploaded anyway). Still, I believe most curves are accurate. I'm providing these for your reference. And we recommend increasing the training iterations as much as possible—for example, to 40 or more—to ensure adequate training. If iteration 1 fails to train effectively, iteration 2 may fall into a local minimum. Thank you again for your attention to our work!
To fully reproduce our results, the same benchmark should be used.
Citation
If you find our works helpful, please cite
@misc{he2025visplay,
title={VisPlay: Self-Evolving Vision-Language Models from Images},
author={Yicheng He and Chengsong Huang and Zongxia Li and Jiaxin Huang and Yonghui Yang},
year={2025},
eprint={2511.15661},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.15661},
}
Our framework is directly based on the great work of Vision-SR1 and R-Zero. So,we recommend to also cite the sourcecode work.
@misc{li2025selfrewardingvisionlanguagemodelreasoning,
title={Self-Rewarding Vision-Language Model via Reasoning Decomposition},
author={Zongxia Li and Wenhao Yu and Chengsong Huang and Rui Liu and Zhenwen Liang and Fuxiao Liu and Jingxi Che and Dian Yu and Jordan Boyd-Graber and Haitao Mi and Dong Yu},
year={2025},
eprint={2508.19652},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.19652},
}
@article{huang2025rzeroselfevolvingreasoningllm,
title={R-Zero: Self-Evolving Reasoning LLM from Zero Data},
author={Chengsong Huang and Wenhao Yu and Xiaoyang Wang and Hongming Zhang and Zongxia Li and Ruosen Li and Jiaxin Huang and Haitao Mi and Dong Yu},
year={2025},
eprint={2508.05004},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.05004},
}
Related Skills
node-connect
347.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
