Emu3.5
Native Multimodal Models are World Learners
Install / Use
/learn @baaivision/Emu3.5README
Emu3.5 Team, BAAI
Project Page | 🤗HF Models | Paper | App
</div><div align='center'> <img src="./assets/arch.png" class="interpolation-image" alt="arch." height="100%" width="100%" /> </div> <div align='center'> <img src="./assets/co.png" class="interpolation-image" alt="arch." height="90%" width="90%" /> </div>🔔 Latest: Emu3.5 Web & Mobile Apps and vLLM offline inference are live — see 🔥 News for details.
| 🔹 | Core Concept | Description | | :-: | :--------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------- | | 🧠 | Unified World Modeling | Predicts the next state jointly across vision and language, enabling coherent world modeling and generation. | | 🧩 | End-to-End Pretraining | Trained with a unified next-token prediction objective over interleaved vision–language sequences. | | 📚 | Over 10T+ Multimodal Tokens | Pre-trained on over 10 trillion interleaved tokens from video frames and transcripts, capturing spatiotemporal structure. | | 🔄 | Native Multimodal I/O | Processes and generates interleaved visual–text sequences without modality adapters or task-specific heads. | | 🎯 | RL Post-Training | Large-scale reinforcement learning enhances reasoning, compositionality, and generation quality. | | ⚡ | Discrete Diffusion Adaptation (DiDA) | Converts sequential decoding → bidirectional parallel prediction, achieving ≈20× faster inference without performance loss. | | 🖼️ | Versatile Generation | Excels in long-horizon vision–language generation, any-to-image (X2I) synthesis, and text-rich image creation. | | 🌐 | Generalizable World Modeling | Enables spatiotemporally consistent world exploration, and open-world embodied manipulation across diverse scenarios. | | 🏆 | Performance Benchmark | Matches Gemini 2.5 Flash Image (Nano Banana) on image generation/editing, and outperforms on interleaved generation tasks. |
<a id="news"></a>
🔥 News
- 2025-11-28 · 🌐 Emu3.5 Web & Mobile Apps Live — Official product experience is now available on the web at zh.emu.world (Mainland China) and emu.world (global) 🎉 The new homepage highlights featured cases and a “Get Started” entry, while the workspace and mobile apps bring together creation, inspiration feed, history, profile, and language switch across web, Android APK, and H5. (See more details below.)
- 2025-11-19 · 🚀 vLLM Offline Inference Released — Meet
inference_vllm.pywith a new cond/uncond batch scheduler, delivering 4–5× faster end-to-end generation on vLLM 0.11.0 across Emu3.5 tasks. Jump to #Run Inference with vLLM for setup guidance and see PR #47 for full details. - 2025-11-17 · 🎛️ Gradio Demo (Transformers Backend) — Introduced
gradio_demo_image.pyandgradio_demo_interleave.pypresets for the standard Transformers runtime, providing turnkey T2I/X2I and interleaved generation experiences with streaming output. Try the commands in #Gradio Demo to launch both UIs locally.
Table of Contents
1. Model & Weights
| Model name | HF Weight | | ------------------------ | --------- | | Emu3.5 | 🤗 HF link | | Emu3.5-Image | 🤗 HF link | | Emu3.5-VisionTokenizer | 🤗 HF link |
Note:
- Emu3.5 supports general-purpose multimodal predictions, including interleaved image-text generation and single-image generation (T2I/X2I) tasks.
- Emu3.5-Image is a model focused on T2I/X2I tasks for best performance on these scenarios.
- Both models are pure next-token predictors without DiDA acceleration (each image may take several minutes to generate).
- ⚡ Stay tuned for DiDA-accelerated weights.
💡 Usage tip:
For interleaved image-text generation, use Emu3.5.
For single-image generation (T2I and X2I), use Emu3.5-Image for the best quality.
2. Quick Start
Environment Setup
# Requires Python 3.12 or higher.
git clone https://github.com/baaivision/Emu3.5
cd Emu3.5
pip install -r requirements/transformers.txt
pip install flash_attn==2.8.3 --no-build-isolation
Configuration
Edit configs/config.py to set:
- Paths:
model_path,vq_pathYou can use either a local path (e.g., downloaded HuggingFace weights) or a remote HuggingFace Hub ID for automatic download:vq_path = "BAAI/Emu3.5-VisionTokenizer" # remote, auto-download model_path = "BAAI/Emu3.5" # remote, auto-download # or vq_path = "/path/to/local/Emu3.5-VisionTokenizer" # local path model_path = "/path/to/local/Emu3.5" # local path - Task template:
task_type in {t2i, x2i, howto, story, explore, vla} - Input image:
use_image(True to provide reference images, controls <|IMAGE|> token); setreference_imagein each prompt to specify the image path. For x2i task, we recommand usingreference_imageas a list containing single/multiple image paths to be compatible with multi-image input. - Sampling:
sampling_params(classifier_free_guidance, temperature, top_k/top_p, etc.) - Aspect Ratio (for t2i task):
aspect_ratio("4:3", "21:9", "1:1", "auto" etc..)
Run Inference
python inference.py --cfg configs/config.py
Example Configurations by Task
Below are example commands for different tasks. Make sure to set CUDA_VISIBLE_DEVICES according to your available GPUs.
# 🖼️ Text-to-Image (T2I) task
CUDA_VISIBLE_DEVICES=0 python inference.py --cfg configs/example_config_t2i.py
# 🔄 Any-to-Image (X2I) task
CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_x2i.py
# 🎯 Visual Guidance task
CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_visual_guidance.py
# 📖 Visual Narrative task
CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_visual_narrative.py
# After running inference, the model will generate results in protobuf format (.pb files) for each input prompt.
Protobuf outputs are written to outputs/<exp_name>/proto/. For better throughput, we recommend ≥2 GPUs.
Run Inference with vLLM
vLLM Enviroment Setup
- [Optional Recommendation] Use a new virtual environment for vLLM backend.
conda create -n Emu3p5 python=3.12
- Install vLLM and apply the patch files.
# Requires Python 3.12 or higher.
# Recommended: CUDA 12.8.
pip install -r requirements/vllm.txt
pip install flash_attn==2.8.3 --no-build-isolation
cd Emu3.5
python src/patch/apply.py
Example Configurations by Task
# 🖼️ Text-to-Image (T2I) task
CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_t2i.py
# 🔄 Any-to-Image (X2I) task
CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_x2i.py
# 🎯 Visual Guidance task
CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_visual_guidance.py
# 📖 Visual Narrative task
CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_visual_narrative.py
Visualize Protobuf Outputs
To visualize generated protobuf files (--video: Generate video visualizations for interleaved output):
python src/utils/vis_proto.py --input <input_proto_path> [--output <output_dir>] [--video]
--input: supports a single.pbfile or a directory; directories are scanned recursively.--output: optional; defaults to<input_dir>/results/<file_stem>for files, or<parent_dir_of_input>/resultsfor directories.
Expected output directory layout (example):
results/<pb_name>/
├── 000_question.txt
├── 000_global_cot.txt
├── 001_text.txt
├── 001_00_image.png
├── 001_00_image_cot.txt
├── 002_text.txt
├── 002_00_image.png
├── ...
└── video.mp4 # only when --video is enabled
Each *_text.txt stores decoded segments, *_image.png stores generated frames, and matching *_image_cot.txt keeps image-level chain-of-thought notes when available.
3. Gradio Demo
We provide two Gradio Demos for different application scenarios:
Emu3.5-Image Demo —— Interactive interface optimized for Text-to-Image (T2I) and Any-to-Image (X2I) tasks:
CUDA_VISIBLE_DEVICES=0,1 python gradio_demo_image.py --host 0.0.0.0 --port 7860
Emu3.5-Interleave Demo —— Launch Emu3.5 Interleave Tasks (Visual Guidance and Visual Narrate) Gradio Demo
CUDA_VISIBLE_DEVICES=0,1 python gradio_demo_interleave.py --host 0.0.0.0 --port 7860
Features
- Image Generation: Support Text-to-Image Generation and Multimodal Image Generation
- Interleaved Generation: Support long-sequence creation
