Bagel
Open-source unified multimodal model
Install / Use
/learn @ByteDance-Seed/BagelREADME
Unified Model for Multimodal Understanding and Generation
<p align="center"><img src="assets/teaser.webp" width="95%"></p> <!-- ## 🧠 Method BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture to maximize the model’s capacity to learn from richly diverse multimodal information. Following the same principle of capacity maximization, it utilizes two separate encoders to capture pixel-level and semantic-level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target. BAGEL scales MoT’s capacity through Pre-training, Continued Training, and Supervised Finetuning on trillions of interleaved multimodal tokens spanning language, image, video, and web data. It surpasses open models on standard understanding and generation benchmarks and demonstrates advanced in-context multimodal abilities like free-form image editing, future frame prediction, 3D manipulation, world navigation, and sequential reasoning. <p align="center"><img src="assets/arch.png" width="95%"></p> ## 🌱 Emerging Properties <p align="center"><img src="assets/emerging_curves.png" width="95%"></p> As we scale up BAGEL’s pretraining with more multimodal tokens, we observe consistent performance gains across understanding, generation, and editing tasks. Different capabilities emerge at distinct training stages—multimodal understanding and generation appear early, followed by basic editing, while complex, intelligent editing emerges later. This staged progression suggests an emergent pattern, where advanced multimodal reasoning builds on well-formed foundational skills. Ablation studies further show that combining VAE and ViT features significantly improves intelligent editing, underscoring the importance of visual-semantic context in enabling complex multimodal reasoning and further supporting its role in the emergence of advanced capabilities. -->Chaorui Deng*, Deyao Zhu*, Kunchang Li*, Chenhui Gou*, Feng Li*, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi :email: , Haoqi Fan* :tophat:
contact: shiguang.sg@bytedance.com
We present BAGEL, an open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data. BAGEL outperforms the current top‑tier open‑source VLMs like Qwen2.5-VL and InternVL-2.5 on standard multimodal understanding leaderboards, and delivers text‑to‑image quality that is competitive with strong specialist generators such as SD3. Moreover, BAGEL demonstrates superior qualitative results in classical image‑editing scenarios than the leading open-source models. More importantly, it extends to free-form visual manipulation, multiview synthesis, and world navigation, capabilities that constitute "world-modeling" tasks beyond the scope of previous image-editing models. The figure below showcases BAGEL's qualitative performance.
📢 News
We sincerely thank all contributors from the open community for their valuable support.
- June 15, 2025: We have updated and fixed the evaluation results for KRIS-Bench and RISEBench. Our model, BAGEL, demonstrates performance comparable to Gemini 2.0 on these reasoning benchmarks. We have also released the evaluation code for both KRIS-Bench and RISEBench, along with ImgEdit-Bench. For further details, please refer to EVAL.
- Jun 5, 2025: Thanks to @davideuler for contributing the Dockerfile with prebuilt flash_attn.
- May 30, 2025: Many thanks to @prartio for contributing the Windows 11 installation guideline, and to @gluttony-10 for his work on the inference of quantization.
- May 29, 2025: Special thanks to @jnc-nj for contributing the Dockerfile.
- May 26, 2025: Thanks to @neverbiasu for contributing ComfyUI.
- May 25, 2025: Special thanks to @LeanModels for providing the DF11-compressed version, and to @Gapeleon for the INT8-compressed version. We also appreciate @gluttony-10 for contributions to the Windows package.
- May 24, 2025: Together with @wangwei1237, @gluttony-10, and @KingNish24, we built a Gradio app and launched a Hugging Face Space.
- May 23, 2025: We have provided a training guideline in TRAIN.
- May 20, 2025: We released the official website, demo, model, and report for BAGEL.
📮 Notice
Call for Bad Cases: If you have encountered any cases where the model performs poorly, we would greatly appreciate it if you could share them in the issue#11 or Discord.
About Inference Hyperparameters:
cfg_text_scale: Controls how strongly the model follows the text prompt.1.0disables text guidance. Typical range:4.0–8.0.cfg_image_scale: Controls how much the model preserves input image details.1.0disables image guidance. Typical range:1.0–2.0.cfg_interval: Fraction of denoising steps where CFG is applied. Later steps can skip CFG to reduce computation. Typical:[0.4, 1.0].timestep_shift: Shifts the distribution of denoising steps. Higher values allocate more steps at the start (affects layout); lower values allocate more at the end (improves details).num_timesteps: Total denoising steps. Typical:50.cfg_renorm_min: Minimum value for CFG-Renorm.1.0disables renorm. Typical:0.cfg_renorm_type: CFG-Renorm method:global: Normalize over all tokens and channels (default for T2I).channel: Normalize across channels for each token.text_channel: Likechannel, but only applies to text condition (good for editing, may cause blur).
- If edited images appear blurry, try
globalCFG-Renorm, decreasecfg_renorm_minor decreasecfg_scale.
🔥 Quick Start
1️⃣ Set up environment
git clone https://github.com/bytedance-seed/BAGEL.git
cd BAGEL
conda create -n bagel python=3.10 -y
conda activate bagel
pip install -r requirements.txt
pip install flash_attn==2.5.8 --no-build-isolation
2️⃣ Download pretrained checkpoint
from huggingface_hub import snapshot_download
save_dir = "models/BAGEL-7B-MoT"
repo_id = "ByteDance-Seed/BAGEL-7B-MoT"
cache_dir = save_dir + "/cache"
snapshot_download(cache_dir=cache_dir,
local_dir=save_dir,
repo_id=repo_id,
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)
3️⃣ Use Gradio WebUI to start playing with BAGEL!
# For 32GB+ VRAM GPU or multi GPUs.
python app.py
# For 12~32GB VRAM GPU, recommend using NF4 quantization. And use Chinese interface.
python app.py --mode 2 --zh
# For 22~32GB VRAM GPU, not recommended to use INT8 quantization.
python app.py --mode 3
🔥 Train & Eval
Train
bash scripts/train.sh
You can replace the variables
Related Skills
node-connect
352.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.5kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
