SRUM
Official repo of paper "SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models". A post-training framework that creates a cost-effective, self-iterative optimization loop.
Install / Use
/learn @WayneJin0918/SRUMREADME
SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models
<!-- ## 🧠 Method BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture to maximize the model’s capacity to learn from richly diverse multimodal information. Following the same principle of capacity maximization, it utilizes two separate encoders to capture pixel-level and semantic-level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target. BAGEL scales MoT’s capacity through Pre-training, Continued Training, and Supervised Finetuning on trillions of interleaved multimodal tokens spanning language, image, video, and web data. It surpasses open models on standard understanding and generation benchmarks and demonstrates advanced in-context multimodal abilities like free-form image editing, future frame prediction, 3D manipulation, world navigation, and sequential reasoning. <p align="center"><img src="assets/arch.png" width="95%"></p> ## 🌱 Emerging Properties <p align="center"><img src="assets/emerging_curves.png" width="95%"></p> As we scale up BAGEL’s pretraining with more multimodal tokens, we observe consistent performance gains across understanding, generation, and editing tasks. Different capabilities emerge at distinct training stages—multimodal understanding and generation appear early, followed by basic editing, while complex, intelligent editing emerges later. This staged progression suggests an emergent pattern, where advanced multimodal reasoning builds on well-formed foundational skills. Ablation studies further show that combining VAE and ViT features significantly improves intelligent editing, underscoring the importance of visual-semantic context in enabling complex multimodal reasoning and further supporting its role in the emergence of advanced capabilities. -->Weiyang Jin*, Yuwei Niu*, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, Xihui Liu :email:
contact: xihuiliu@hku.hk
We present SRUM, a post-training reward fine-tuning method based on Unified Multimodal Models (UMMs) leverages UMMs' inherent understanding capabilities to boost their generative abilities, bridging the gaps in performance caused by conflicts during the previous training phase. SRUM demonstrates exceptional generalization across both common positions and world knowledge.. The figure below showcases SRUM's qualitative performance compared with SFT and Base Model.
📢 News
We sincerely thank all contributors from the open community for their valuable support.
- Nov. 15, 2025: We released the official website, model, and report for SRUM. And please upvote for our huggingface daily paper as well as try the demo
📮 Notice
<!-- **Call for Bad Cases:** If you have encountered any cases where the model performs poorly, we would greatly appreciate it if you could share them in the [issue#11](https://github.com/ByteDance-Seed/Bagel/issues/11) or [Discord](https://discord.gg/Z836xxzy). -->Follow the Bagel's original settings, you should focus:
About Inference Hyperparameters:
cfg_text_scale: Controls how strongly the model follows the text prompt.1.0disables text guidance. Typical range:4.0–8.0.cfg_image_scale: Controls how much the model preserves input image details.1.0disables image guidance. Typical range:1.0–2.0.cfg_interval: Fraction of denoising steps where CFG is applied. Later steps can skip CFG to reduce computation. Typical:[0.4, 1.0].timestep_shift: Shifts the distribution of denoising steps. Higher values allocate more steps at the start (affects layout); lower values allocate more at the end (improves details).num_timesteps: Total denoising steps. Typical:50.cfg_renorm_min: Minimum value for CFG-Renorm.1.0disables renorm. Typical:0.cfg_renorm_type: CFG-Renorm method:global: Normalize over all tokens and channels (default for T2I).channel: Normalize across channels for each token.text_channel: Likechannel, but only applies to text condition (good for editing, may cause blur).
- If edited images appear blurry, try
globalCFG-Renorm, decreasecfg_renorm_minor decreasecfg_scale.
🔥 Quick Start
1️⃣ Set up environment
git clone https://github.com/WayneJin0918/SRUM
cd SRUM
conda env create -f environment.yaml
conda activate SRUM
pip install -r requirements.txt
if flash attention is hard to pip, please follow:
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.0.post2/flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
Or you can follow the settings of BAGEL
2️⃣ Download Bagel pretrained or our SRUM checkpoint
#bagel
from huggingface_hub import snapshot_download
save_dir = "models/BAGEL-7B-MoT"
repo_id = "ByteDance-Seed/BAGEL-7B-MoT"
cache_dir = save_dir + "/cache"
snapshot_download(cache_dir=cache_dir,
local_dir=save_dir,
repo_id=repo_id,
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)
#SRUM
from huggingface_hub import snapshot_download
save_dir = "models/SRUM_BAGEL_7B_MoT"
repo_id = "Wayne-King/SRUM_BAGEL_7B_MoT"
cache_dir = save_dir + "/cache"
snapshot_download(cache_dir=cache_dir,
local_dir=save_dir,
repo_id=repo_id,
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)
<!-- 3️⃣ Use Gradio WebUI to start playing with BAGEL!
```bash
# For 32GB+ VRAM GPU or multi GPUs.
python app.py
```
```bash
# For 12~32GB VRAM GPU, recommend using NF4 quantization. And use Chinese interface.
python app.py --mode 2 --zh
```
```bash
# For 22~32GB VRAM GPU, not recommended to use INT8 quantization.
python app.py --mode 3
``` -->
🔥 Train & Eval
Train
1️⃣ Data preparation
Use srum_data_infer/compt2i.sh for images inference in multi-gpus. Please change the output file address --output_dir as ./your_images_address
bash srum_data_infer/compt2i.sh
Then you will get the image folder ./your_images_address and next use srum_data_infer/vlm.sh for scoring. generally, --image_dir in bash file should same as ./your_address.
Before using vlm inference, you should download the SAM weights under SRUM
wget https://huggingface.co/HCMUE-Research/SAM-vit-h/resolve/main/sam_vit_h_4b8939.pth
bash srum_data_infer/vlm.sh
Now, you have jsonl file your_vlm_output.jsonl and image folder ./your_images_address, add these into data/dataset_info.py.
'comp_data': {
'jsonl_path': './your_vlm_output.jsonl',
# Replace 'image_base_dir' with the 'image_dirs' dictionary
'image_dirs': {
# Key 'good' for the ground-truth images
'good': './your_images_address',
# Key 'bad' for the input images that have rewards same as good one
'bad': './your_images_address'
},
'num_total_samples': 5911, # total number of samples in dataset
},
Or you can directly use our HF training data in huggingface or MS training data in modelscape.
2️⃣ Starting training
Down the base model. Then, add yaml file: scripts/data/rft_comp.yaml.
regional_reward:
dataset_names:
- comp_data
image_transform_args:
image_stride: 256
max_image_size: 1024
min_image_size: 512
num_used_data: # The sum should be larger that NUM_GPUS x NUM_WORKERS
- 8
weight: 1
bash scripts/train_reg_comp.sh
Please do not forger to change the PYTHONPATH to your root SRUM path like /mnt/SRUM. If you are not using 8 GPUs in one node, please change the --num_shard to your number of GPUs.
And we highly recommand max of --save_every is `--t
