SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Weiyang Jin*, Yuwei Niu*, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, Xihui Liu :email:

contact: xihuiliu@hku.hk

We present SRUM, a post-training reward fine-tuning method based on Unified Multimodal Models (UMMs) leverages UMMs' inherent understanding capabilities to boost their generative abilities, bridging the gaps in performance caused by conflicts during the previous training phase. SRUM demonstrates exceptional generalization across both common positions and world knowledge.. The figure below showcases SRUM's qualitative performance compared with SFT and Base Model.

📢 News

We sincerely thank all contributors from the open community for their valuable support.

Nov. 15, 2025: We released the official website, model, and report for SRUM. And please upvote for our huggingface daily paper as well as try the demo

📮 Notice

Follow the Bagel's original settings, you should focus:

About Inference Hyperparameters:

cfg_text_scale: Controls how strongly the model follows the text prompt. 1.0 disables text guidance. Typical range: 4.0–8.0.
cfg_image_scale: Controls how much the model preserves input image details. 1.0 disables image guidance. Typical range: 1.0–2.0.
cfg_interval: Fraction of denoising steps where CFG is applied. Later steps can skip CFG to reduce computation. Typical: [0.4, 1.0].
timestep_shift: Shifts the distribution of denoising steps. Higher values allocate more steps at the start (affects layout); lower values allocate more at the end (improves details).
num_timesteps: Total denoising steps. Typical: 50.
cfg_renorm_min: Minimum value for CFG-Renorm. 1.0 disables renorm. Typical: 0.
cfg_renorm_type: CFG-Renorm method:
- global: Normalize over all tokens and channels (default for T2I).
- channel: Normalize across channels for each token.
- text_channel: Like channel, but only applies to text condition (good for editing, may cause blur).
If edited images appear blurry, try global CFG-Renorm, decrease cfg_renorm_min or decrease cfg_scale.

🔥 Quick Start

1️⃣ Set up environment

git clone https://github.com/WayneJin0918/SRUM
cd SRUM
conda env create -f environment.yaml
conda activate SRUM
pip install -r requirements.txt

if flash attention is hard to pip, please follow:

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.0.post2/flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Or you can follow the settings of BAGEL

2️⃣ Download Bagel pretrained or our SRUM checkpoint

#bagel
from huggingface_hub import snapshot_download

save_dir = "models/BAGEL-7B-MoT"
repo_id = "ByteDance-Seed/BAGEL-7B-MoT"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
  local_dir=save_dir,
  repo_id=repo_id,
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)

#SRUM
from huggingface_hub import snapshot_download

save_dir = "models/SRUM_BAGEL_7B_MoT"
repo_id = "Wayne-King/SRUM_BAGEL_7B_MoT"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
  local_dir=save_dir,
  repo_id=repo_id,
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)

🔥 Train & Eval

Train

1️⃣ Data preparation

Use srum_data_infer/compt2i.sh for images inference in multi-gpus. Please change the output file address --output_dir as ./your_images_address

bash srum_data_infer/compt2i.sh

Then you will get the image folder ./your_images_address and next use srum_data_infer/vlm.sh for scoring. generally, --image_dir in bash file should same as ./your_address.

Before using vlm inference, you should download the SAM weights under SRUM

wget https://huggingface.co/HCMUE-Research/SAM-vit-h/resolve/main/sam_vit_h_4b8939.pth

bash srum_data_infer/vlm.sh

Now, you have jsonl file your_vlm_output.jsonl and image folder ./your_images_address, add these into data/dataset_info.py.

        'comp_data': {
            'jsonl_path': './your_vlm_output.jsonl',
            # Replace 'image_base_dir' with the 'image_dirs' dictionary
            'image_dirs': {
                # Key 'good' for the ground-truth images
                'good': './your_images_address',
                # Key 'bad' for the input images that have rewards same as good one
                'bad': './your_images_address' 
            },
            'num_total_samples': 5911, # total number of samples in dataset
        },

Or you can directly use our HF training data in huggingface or MS training data in modelscape.

2️⃣ Starting training

Down the base model. Then, add yaml file: scripts/data/rft_comp.yaml.

regional_reward:
  dataset_names:
  - comp_data
  image_transform_args:
    image_stride: 256
    max_image_size: 1024
    min_image_size: 512
  num_used_data: # The sum should be larger that NUM_GPUS x NUM_WORKERS
  - 8
  weight: 1

bash scripts/train_reg_comp.sh

Please do not forger to change the PYTHONPATH to your root SRUM path like /mnt/SRUM. If you are not using 8 GPUs in one node, please change the --num_shard to your number of GPUs.

And we highly recommand max of --save_every is `--t

SRUM

Install / Use

README