DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning

This repository holds the code and data of DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning.

Update on Jun 4, 2025: release codes and paper
Update on Jun 9, 2025: DreamPRM (o4-mini) has been added to the top of the MathVista Leaderboard (testmini) with 85.2% accuracy!
Update on Jun 10, 2025: update instructions for extending DreamPRM to o4-mini

DreamPRM — Domain-Reweighted Process Reward Model for Multimodal Reasoning
DreamPRM tackles the dataset quality imbalance and distribution shift that plague multimodal reasoning by domain-reweighting.
It jointly learns (i) a high-fidelity Process Reward Model (PRM) and (ii) optimal domain weights through a bi-level optimisation (BLO) loop, delivering a consistent +4 pp average gain on five public benchmarks.

Example
Method Overview
Quick Start
Customize Your Datasets
Extend DreamPRM to o4-mini (new)
Acknowledgement
License
Citation

Example <a name="example"></a>

example DreamPRM improves multimodal reasoning by mitigating the dataset quality imbalance problem. Left: On five benchmarks, DreamPRM outperforms base model (InternVL-2.5-8B-MPO) by an average of +4.0%. DreamPRM also consistently surpasses Vanilla PRM trained without data selection. Right: Easy AI2D questions (weight 0.55) vs. hard M3COT questions (weight 1.49) shows how DreamPRM prioritizes data that demand deeper reasoning - samples requiring knowledge from both textual and visual modalities for step-by-step logical deduction.

Learned domain weights

DreamPRM significantly outperforms o4-mini in pass@1 accuracy (with temperature fixed at 1.0, following OpenAI API defaults), achieving a 4.6% absolute improvement. It also surpasses the widely used self-consistency (Consensus) method based on majority voting for reasoning chain selection.

Method Overview <a name="installation"></a>

Method flowchart

Training PRM and PRM for inference General flow of training PRM and using PRM for inference. Training phase: Train PRM with Monte Carlo signals from intermediate steps of Chain-of-Thoughts (CoTs). Inference phase: Use the trained PRM to verify CoTs step by step and select the best CoT. Conventional training of PRM has poor generalization capability due to distribution shift between training set and testing set.

DreamPRM Overview

The proposed bi-level optimization based domain-reweighting method. Lower-level optimization: In this stage, PRM’s parameters are updated on multiple datasets with domain weights, allowing the PRM to prioritize domains with better quality. Upper-level optimization: In this stage, the PRM is evaluated on a separate meta dataset to compute an aggregation function loss and optimize the domain weights.

Key Components

| Component | Purpose | Highlight | |-----------|---------|-----------| | Domain-Reweighted Fine-Tuning | Re-weights K training domains via parameters αₖ | Gives harder, higher-quality datasets greater gradient influence | | Bi-Level Optimisation (BLO) | Lower level updates PRM weights ϕ; upper level updates α | Learns both model and data weights in one run | | Aggregation Function Loss | Meta-level loss that mirrors inference-time scoring | Aligns training with real PRM usage |

Domain weights results

Learned domain weights

DreamPRM’s learned domain weights span 0.55–1.49, down-weighting noisy sets like AI2D and up-weighting challenging ones like M3CoT. This correlation with dataset difficulty underpins its performance gains.

Quick Start <a name="quick-start"></a>

All commands below are illustrative—rename scripts / paths to match your repo.

1. Codes

Git clone our repository, creating a python environment and ativate it via the following command

git https://github.com/coder-qicao/DreamPRM.git
cd DreamPRM

2. Environment

# (a) create conda env
conda create -n dreamprm python=3.10 -y
conda activate dreamprm

# (b) install requirements
pip install -r requirements.txt   # torch betty, transformers, accelerate, ...

Verify the installation of torch and torchvision is successful by running python -c "import torchvision; print(torchvision.__version__)". If it outputs the version number without any warnings or errors, then you are good to go. If it outputs any warnings or errors, try to uninstall torch by conda uninstall pytorch torchvision torchaudio cudatoolkit and then reinstall them following here. You need to find the correct command according to the CUDA version your GPU driver supports (check nvidia-smi).

3. Domain-reweighting

The current version of DreamPRM is built on Qwen2-VL-2B-Instruct. Please download Qwen2-VL weights from https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct.

Domain-reweighting for DreamPRM fine-tuning:

python main.py \\
  --train_json_file "data/train.json" \\
  --meta_json_file "data/meta.json" \\
  --weights_path "weights"\\

You need at least 80 GB GPU memory for the training.

4. Configuration Parameters

In addition, you may want to change the number of epochs and other hyper-parameters there, such as iteration_num, unroll_steps, gradiant_accumulation,lr, scheduler_step_size, etc.

Data & Model Paths

| Argument | Type | Default | Description | |----------|------|---------|-------------| | --train_json_file | str | None | Path to training dataset JSON file | | --meta_json_file | str | None | Path to meta dataset JSON file | | --weights_path | str | None | Directory to save/load model weights | | --reward_model | str | "Qwen/Qwen2-VL-2B-Instruct" | Pretrained reward model identifier |

Training Setup

| Argument | Type | Default | Description | |----------|------|---------|-------------| | --iteration_num | int | 10000 | Total training iterations | | --batch_size | int | 1 | Training batch size | | --max_epoch | int | 120 | Maximum training epochs | | --device | str | "cuda" | Compute device ("cuda" or "cpu") | | --precision | str | "bf16" | Floating point precision (bf16/fp16/fp32) | | --strategy | str | "default" | Training strategy (default) | | --seed | int | 1 | Random seed for reproducibility | | --local_rank | int | 0 | Local rank for distributed training |

Optimization Parameters

| Argument | Type | Default | Description | |----------|------|---------|-------------| | --lr | float | 5e-7 | Main optimizer learning rate | | --meta_lr | float | 0.01 | Meta-optimizer learning rate | | --weight_decay | float | 1e-3 | Weight decay (L2 penalty) | | --meta_weight_decay | float | 0.0 | Meta-optimizer weight decay | | --scheduler_step_size | int | 5000 | Steps between LR adjustments | | --scheduler_gamma | float | 0.5 | LR decay multiplier |

Advanced Training

| Argument | Type | Default | Description | |----------|------|---------|-------------| | --unroll_steps | int | 5 | Unrolled optimization steps | | --gradiant_accumulation | int | 1 | Gradient accumulation steps |

Checkpoint & Visualization

| Argument | Type | Default | Description | |----------|------|---------|-------------| | --save_every_iterations | int | 1000 | Checkpoint save interval |

Example Usage

python main.py \\
  --train_json_file data/train.json \\
  --meta_json_file data/meta.json \\
  --weights_path models/dreamprm \\
  --iteration_num 20000 \\
  --lr 1e-6 \\
  --meta_lr 0.05 \\
  --precision bf16 \\
  --reward_model "Qwen/Qwen2-VL-7B-Instruct" \\
  --unroll_steps 8 \\
  --save_every_iterations 500

Customized Your Datasets <a name="customize-your-own-datasets"></a>

We provide demo datasets with 10 domains (10k training samples) and 500 meta samples in our repository:

data/
├── meta.json
└── train.json

Training Dataset Format (for lower-level optimization)

Each sample in the training dataset should follow this format:

{
    "id": 1128,                   # Unique question identifier
    "sid": 1,                     # Step number identifier
    "input": "Your task is...",    # Full question prompt
    "add": "Step 1: Restate...",   # Model's partial response
    "ground_truth": "1.78947",     # Correct final answer
    "image_path": "dataset/...",   # Path to input image
    "dataset": "chartqa",          # Domain name
    "score": 7,                    # Monte Carlo score
    "times": 11,                   # Monte Carlo iterations
    "accuracy": 0.6363             # Estimated accuracy (0-1)
}

Minimal Custom Training Sample Format:

{
    "input": "...",                # Question prompt (required)
    "add": "Step 1: ...",          # Model's partial response (required)
    "image_path": "xxx.png",       # Input image path (required)
    "dataset": "...",              # Domain name (required)
    "accuracy": 0.6363             # Estimated accuracy (0-1, required)
}

Meta Dataset Format (for upper-level optimization)

Each sample in the meta dataset should follow this format:

{
    "id": 2,                       # Unique question identifier
    "true_false": True,             # Ground truth label
    "input": "Question: The...",    # Full question + model response
    "image_path": "dataset/..."     # Path to input image
}

Minimal Custom Meta Sample Format:

{
    "true_false": True

DreamPRM

Install / Use

README