PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning (ICLR 2026)

</div> <div align="center">

</div>

PAPO, a novel policy gradient algorithm that enhances multimodal reasoning through visually grounded optimization. PAPO can serve as a direct drop-in replacement for GRPO or DAPO without any additional assumptions.

🔥 News

Jan 2026: PAPO is accepted to ICLR 2026
[x] July 2025: Released PAPO_G (GRPO) models
[x] July 2025: Released PAPO_G (GRPO) code
[x] August 2025: Released PAPO_D (DAPO) models
[x] August 2025: Released PAPO_D (DAPO) code

🌟 Key Highlights

4.4%-17.5% overall improvement on diverse multimodal benchmarks
8.0%-19.1% improvement on tasks high vision-dependentcy
30.5% reduction in perception errors
No additional data or external reward models required
Serves as a direct drop-in replacement for GRPO and DAPO

📖 Methodology

Perception Bottleneck

We identified that 67% of errors in current multimodal reasoning models stem from poor perception rather than logical reasoning failures.

PAPO Algorithm

PAPO extends GRPO/DAPO by adding an Implicit Perception Loss that maximizes the KL divergence between model outputs on original vs. corrupted (masked) images:

The core intuition is that a well-behaved multimodal model should produce significantly different outputs when visual information is corrupted, indicating reliance on meaningful visual content. To further enhance training stability, we introduce Double Entropy Loss, an effective regularizer that prevents model collapse while preserving performance.

Main Results

PAPO consistently outperforms GRPO/DAPO across diverse benchmarks, with particularly pronounced improvements on vision-dependent tasks:

📊 Data

We adapt multiple multimodel reasoning benchmarks to construct our training and evaluation datasets.

Training Data

Training: We adapt TIGER-Lab/ViRL39K for training. The processed dataset can be found at: PAPOGalaxy/PAPO_ViRL39K_train.
Validation (optional): We use the testset from MMK12 for validation during training. Note that this is solely for monitoring, we do not pick checkpoints based on this. The processed dataset can be found PAPOGalaxy/PAPO_MMK12_test.

Evaluation Data

We adapted 8 different multimodal reasoning benchmarks to evaluate PAPO, which are further identify two groups, including General Multimodal Reasoning and Vision-Dependent Multimodal Reasoning. All evaluation benchmarks can be found in https://huggingface.co/datasets/PAPO-Galaxy/PAPO_eval. For MathVista and MathVerse, we filter out instances with free-form answers to ensure verifiable evaluation and to avoid relying on LLM-as-a-judge.

All results in the paper are average accurarcy @ 8 (repeating 8 times), with a temperature set to 1.0.

🚀 Quick Start (Qwen2.5-VL)

Update Support for Qwen3-VL

Please refer to the main_qwen3 branch for instructions on running PAPO with Qwen3-VL.

Environment Setup

Option 1: All-in-one Installation Script

conda create -n papo python=3.10
conda activate papo

cd PAPO
bash scripts/install.sh

Option 2: Using pip

pip install -e .

Training

The main training pipeline is adopted from EasyR1. We support training with different configurations for both Qwen2.5-VL 3B and 7B models:

Qwen2.5-VL 3B: We typically use 2 80G H100 GPUs
Qwen2.5-VL 7B: We typically use 4 80G H100 GPUs

GRPO Baseline

# 3B model
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_3b_grpo.sh

# 7B model  
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_7b_grpo.sh

DAPO Baseline

# 3B model
cd PAPO
bash examples/papo_dapo/qwen2_5_vl_3b_dapo.sh

# 7B model  
cd PAPO
bash examples/papo_dapo/qwen2_5_vl_7b_dapo.sh

PAPO-G (Config for Table 1 Results)

# 3B model
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_3b_grpo_papo.sh

# 7B model  
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_7b_grpo_papo.sh

PAPO-D (Config for Table 1 Results)

# 3B model
cd PAPO
bash examples/papo_dapo/qwen2_5_vl_3b_dapo_papo.sh

# 7B model
cd PAPO
bash examples/papo_dapo/qwen2_5_vl_7b_dapo_papo.sh

PAPO-G + No Reference KL (Config for Table 7 Results)

# 3B model
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_3b_grpo_papo_no_kl_ref.sh

# 7B model
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_7b_grpo_papo_no_kl_ref.sh

Pretrained Checkpoints

A collection of 7B/3B pretrained checkpoints on ViRL39K can be downloaded from here. The checkpoints follows Qwen2.5-VL Huggingface format, which can be inferenced as drop-in replacement to https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct. All checkpoints are corresponding to the last step.

PAPO-GRPO model collection: PAPO-G
- PAPO-G 3B model in Table 1: https://huggingface.co/PAPOGalaxy/PAPO-G-H-Qwen2.5-VL-3B
- PAPO-G 7B model in Table 1: https://huggingface.co/PAPOGalaxy/PAPO-G-H-Qwen2.5-VL-7B
PAPO-DAPO model collection: PAPO-D
- PAPO-D 3B model in Table 1: https://huggingface.co/PAPOGalaxy/PAPO-D-Qwen2.5-VL-3B
- PAPO-D 7B model in Table 1: https://huggingface.co/PAPOGalaxy/PAPO-D-Qwen2.5-VL-7B

Performance Evaluation

To run model inference and evaluation, we integrate the evaluation submodule located at PAPO/PAPO-Eval. Detailed instructions for running inference and evaluation can be found in PAPO-Eval.

# Navigate to PAPO evaluation submodule
cd PAPO-Eval

# Data preprocessing
bash papo_eval/preprocess/preprocess.sh

# Run model inference
bash papo_eval/run_infer.sh

# Run model evaluation
bash papo_eval/run_eval.sh

Additional Implementation Notes on Entropy Losses

In theory, when enabling double entropy loss (adding aug_entropy_loss during the workers/actor/dp_actor.py/update_policy) we need to do an additional forward pass on the masked sequence to recompute the aug_log_probs. In practice, we find that whether doing this additional forward pass does not signiticantly affect the performance. Thus, by default in current implementation, we skipped the recomputation, which still empirically brings slight improvement over single entropy. Detailed discussion can be found in https://github.com/MikeWangWZHL/PAPO/issues/20. We also provide a switch RECOMPUTE_AUG_LOG_PROBS in workers/actor/dp_actor.py to turn on/off this recomputation if one requires the explicit impact on the graidents from the aug_log_probs (note that this will slow down training due to the additional forward pass).

🥰 Acknowledgements

We thank the EasyR1 team for providing the foundational codebase that we adapted to implement PAPO. Our implementation builds upon their efficient RLVR framework and extends it with perception-aware optimization methodologies. We also acknowledge the open-source community for providing the datasets and evaluation benchmarks that made this research possible.

📝 Citation

@article{wang2025perception,
  title={Perception-Aware Policy Optimization for Multimodal Reasoning},
  author={Wa

PAPO

Install / Use

README