VisPlay

VisPlay: Self-Evolving Vision-Language Models

Generate Convert Improve

Install / Use

/learn @bruno686/VisPlay

About this skill

Quality Score

0/100

README

VisPlay: Self-Evolving Vision-Language Models

Reinforcement learning (RL) provides a principled framework for improving vision-language models (VLMs) on complex reasoning tasks. However, existing RL approaches often depend on human-annotated labels or task-specific heuristics to define verifiable rewards—both costly and limited in scalability. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning capabilities from massive unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained using Group Relative Policy Optimization (GRPO), which uses diversity and difficulty rewards to balance the difficulty of generated questions with the quality of silver answers. VisPlay scales efficiently across two model families. Trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks including MM-Vet and MMMU, and establishes a scalable path toward self-evolving multimodal intelligence.

Requirements

The code base adopted from R-Zero and Vision-SR1.

Software Requirements

Python 3.9+
transformers=4.49.0

Self-Evolving Setup

git clone https://github.com/bruno686/VisPlay.git
cd VisPlay
conda create -n VisPlay python=3.11
bash setup.sh

# Set an environment variable for your storage path in every main script.
# This is a large directory where checkpoints and generated data will be saved.
export STORAGE_PATH="/path/to/your/storage"
export HUGGINGFACENAME="yourhuggingfacename"

mkdir -p \
  "$STORAGE_PATH/evaluation" \
  "$STORAGE_PATH/models" \
  "$STORAGE_PATH/generated_question" \
  "$STORAGE_PATH/temp_results"

Self-Play Training Scripts

bash scripts_Qwen-VL-3B/main.sh
bash scripts_Qwen-VL-7B/main.sh
bash scripts_MIMO-VL-7B/main.sh

Evaluation & LLM-as-a-Judge Evaluation

We use ChatGLM-flash as the Judge. Different LLM judges will result in different evaluation results. For reference, we also comput the rule-based evaluation accuracies, which is lower than LLM-as-Judges on Math datasets.

Prepare Benchmark

All we benchmark are from zli12321/datasets, you can directly download them. Ensuring consistency in evaluation is crucial.

Generate responses from trained LLM

We provide all the historic LLM generations for a quick reference and access to the results

bash validation_examples/eval_gen_questions.sh $experiment_name $your_model_path

For example:

bash validation_examples/eval_gen_questions.sh MIMO-VL-7B-solver_v3 /your_path/vr-zero/storage/models/MiMo-VL-7B-SFT_solver_v3/global_step_20/actor/huggingface

Use LLM-as-a-judge to generate result

bash Evaluation/eval.sh

Notes

To facilitate your further review of our experiments, I've made our WandB logs publicly available. However, please note that these logs may be incomplete and may not include all iterations. Additionally, the actual number of training steps completed per iteration might not be fully recorded.

Because, as mentioned in other issues, server limitations forced us to upload manually, so omissions or incorrect curves may exist (only a few—I haven't cleaned them yet, because you know, sometimes parameters are wrong but get uploaded anyway). Still, I believe most curves are accurate. I'm providing these for your reference. And we recommend increasing the training iterations as much as possible—for example, to 40 or more—to ensure adequate training. If iteration 1 fails to train effectively, iteration 2 may fall into a local minimum. Thank you again for your attention to our work!

visplay_wandb_log_public

To fully reproduce our results, the same benchmark should be used.

Citation

If you find our works helpful, please cite

@misc{he2025visplay,
      title={VisPlay: Self-Evolving Vision-Language Models from Images}, 
      author={Yicheng He and Chengsong Huang and Zongxia Li and Jiaxin Huang and Yonghui Yang},
      year={2025},
      eprint={2511.15661},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.15661}, 
}

Our framework is directly based on the great work of Vision-SR1 and R-Zero. So,we recommend to also cite the sourcecode work.

@misc{li2025selfrewardingvisionlanguagemodelreasoning,
      title={Self-Rewarding Vision-Language Model via Reasoning Decomposition}, 
      author={Zongxia Li and Wenhao Yu and Chengsong Huang and Rui Liu and Zhenwen Liang and Fuxiao Liu and Jingxi Che and Dian Yu and Jordan Boyd-Graber and Haitao Mi and Dong Yu},
      year={2025},
      eprint={2508.19652},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.19652}, 
}

@article{huang2025rzeroselfevolvingreasoningllm,
      title={R-Zero: Self-Evolving Reasoning LLM from Zero Data}, 
      author={Chengsong Huang and Wenhao Yu and Xiaoyang Wang and Hongming Zhang and Zongxia Li and Ruosen Li and Jiaxin Huang and Haitao Mi and Dong Yu},
      year={2025},
      eprint={2508.05004},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.05004}, 
}

Related Skills

node-connect

347.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

108.0k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

bruno686

View profile

View on GitHub

GitHub Stars53

CategoryDevelopment

Updated2d ago

Forks9

bruno686/VisPlay

Languages

Python

Security Score

95/100

Audited on Apr 1, 2026

No findings

VisPlay

Install / Use

README

VisPlay: Self-Evolving Vision-Language Models

Requirements

Software Requirements

Self-Evolving Setup

Self-Play Training Scripts

Evaluation & LLM-as-a-Judge Evaluation

Prepare Benchmark

Generate responses from trained LLM

Use LLM-as-a-judge to generate result

Notes

Citation

Related Skills