SkillAgentSearch skills...

Emu3.5

Native Multimodal Models are World Learners

Install / Use

/learn @baaivision/Emu3.5
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align='center'> <h1>Emu3.5: Native Multimodal Models are World Learners</h1>

Emu3.5 Team, BAAI

Project Page | 🤗HF Models | Paper | App

</div>

🔔 Latest: Emu3.5 Web & Mobile Apps and vLLM offline inference are live — see 🔥 News for details.

<div align='center'> <img src="./assets/arch.png" class="interpolation-image" alt="arch." height="100%" width="100%" /> </div> <div align='center'> <img src="./assets/co.png" class="interpolation-image" alt="arch." height="90%" width="90%" /> </div>

| 🔹 | Core Concept | Description | | :-: | :--------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------- | | 🧠 | Unified World Modeling | Predicts the next state jointly across vision and language, enabling coherent world modeling and generation. | | 🧩 | End-to-End Pretraining | Trained with a unified next-token prediction objective over interleaved vision–language sequences. | | 📚 | Over 10T+ Multimodal Tokens | Pre-trained on over 10 trillion interleaved tokens from video frames and transcripts, capturing spatiotemporal structure. | | 🔄 | Native Multimodal I/O | Processes and generates interleaved visual–text sequences without modality adapters or task-specific heads. | | 🎯 | RL Post-Training | Large-scale reinforcement learning enhances reasoning, compositionality, and generation quality. | | ⚡ | Discrete Diffusion Adaptation (DiDA) | Converts sequential decoding → bidirectional parallel prediction, achieving ≈20× faster inference without performance loss. | | 🖼️ | Versatile Generation | Excels in long-horizon vision–language generation, any-to-image (X2I) synthesis, and text-rich image creation. | | 🌐 | Generalizable World Modeling | Enables spatiotemporally consistent world exploration, and open-world embodied manipulation across diverse scenarios. | | 🏆 | Performance Benchmark | Matches Gemini 2.5 Flash Image (Nano Banana) on image generation/editing, and outperforms on interleaved generation tasks. |

<a id="news"></a>

🔥 News

  • 2025-11-28 · 🌐 Emu3.5 Web & Mobile Apps Live — Official product experience is now available on the web at zh.emu.world (Mainland China) and emu.world (global) 🎉 The new homepage highlights featured cases and a “Get Started” entry, while the workspace and mobile apps bring together creation, inspiration feed, history, profile, and language switch across web, Android APK, and H5. (See more details below.)
  • 2025-11-19 · 🚀 vLLM Offline Inference Released — Meet inference_vllm.py with a new cond/uncond batch scheduler, delivering 4–5× faster end-to-end generation on vLLM 0.11.0 across Emu3.5 tasks. Jump to #Run Inference with vLLM for setup guidance and see PR #47 for full details.
  • 2025-11-17 · 🎛️ Gradio Demo (Transformers Backend) — Introduced gradio_demo_image.py and gradio_demo_interleave.py presets for the standard Transformers runtime, providing turnkey T2I/X2I and interleaved generation experiences with streaming output. Try the commands in #Gradio Demo to launch both UIs locally.

Table of Contents

  1. Model & Weights
  2. Quick Start
  3. Gradio Demo
  4. Schedule
  5. Citation

1. Model & Weights

| Model name | HF Weight | | ------------------------ | --------- | | Emu3.5 | 🤗 HF link | | Emu3.5-Image | 🤗 HF link | | Emu3.5-VisionTokenizer | 🤗 HF link |

Note:

  • Emu3.5 supports general-purpose multimodal predictions, including interleaved image-text generation and single-image generation (T2I/X2I) tasks.
  • Emu3.5-Image is a model focused on T2I/X2I tasks for best performance on these scenarios.
  • Both models are pure next-token predictors without DiDA acceleration (each image may take several minutes to generate).
  • Stay tuned for DiDA-accelerated weights.

💡 Usage tip:
For interleaved image-text generation, use Emu3.5.
For single-image generation (T2I and X2I), use Emu3.5-Image for the best quality.

2. Quick Start

Environment Setup

# Requires Python 3.12 or higher.
git clone https://github.com/baaivision/Emu3.5
cd Emu3.5
pip install -r requirements/transformers.txt
pip install flash_attn==2.8.3 --no-build-isolation

Configuration

Edit configs/config.py to set:

  • Paths: model_path, vq_path You can use either a local path (e.g., downloaded HuggingFace weights) or a remote HuggingFace Hub ID for automatic download:
    vq_path = "BAAI/Emu3.5-VisionTokenizer"  # remote, auto-download
    model_path = "BAAI/Emu3.5"               # remote, auto-download
    # or
    vq_path = "/path/to/local/Emu3.5-VisionTokenizer"  # local path
    model_path = "/path/to/local/Emu3.5"               # local path
    
  • Task template: task_type in {t2i, x2i, howto, story, explore, vla}
  • Input image: use_image (True to provide reference images, controls <|IMAGE|> token); set reference_image in each prompt to specify the image path. For x2i task, we recommand using reference_image as a list containing single/multiple image paths to be compatible with multi-image input.
  • Sampling: sampling_params (classifier_free_guidance, temperature, top_k/top_p, etc.)
  • Aspect Ratio (for t2i task): aspect_ratio ("4:3", "21:9", "1:1", "auto" etc..)

Run Inference

python inference.py --cfg configs/config.py

Example Configurations by Task

Below are example commands for different tasks. Make sure to set CUDA_VISIBLE_DEVICES according to your available GPUs.

# 🖼️ Text-to-Image (T2I) task
CUDA_VISIBLE_DEVICES=0 python inference.py --cfg configs/example_config_t2i.py

# 🔄 Any-to-Image (X2I) task
CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_x2i.py

# 🎯 Visual Guidance task
CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_visual_guidance.py

# 📖 Visual Narrative task
CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_visual_narrative.py


# After running inference, the model will generate results in protobuf format (.pb files) for each input prompt.

Protobuf outputs are written to outputs/<exp_name>/proto/. For better throughput, we recommend ≥2 GPUs.

Run Inference with vLLM

vLLM Enviroment Setup

  1. [Optional Recommendation] Use a new virtual environment for vLLM backend.
conda create -n Emu3p5 python=3.12
  1. Install vLLM and apply the patch files.
# Requires Python 3.12 or higher.
# Recommended: CUDA 12.8.
pip install -r requirements/vllm.txt
pip install flash_attn==2.8.3 --no-build-isolation

cd Emu3.5
python src/patch/apply.py

Example Configurations by Task

# 🖼️ Text-to-Image (T2I) task
CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_t2i.py

# 🔄 Any-to-Image (X2I) task
CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_x2i.py

# 🎯 Visual Guidance task
CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_visual_guidance.py

# 📖 Visual Narrative task
CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_visual_narrative.py

Visualize Protobuf Outputs

To visualize generated protobuf files (--video: Generate video visualizations for interleaved output):

python src/utils/vis_proto.py --input <input_proto_path> [--output <output_dir>] [--video]
  • --input: supports a single .pb file or a directory; directories are scanned recursively.
  • --output: optional; defaults to <input_dir>/results/<file_stem> for files, or <parent_dir_of_input>/results for directories.

Expected output directory layout (example):

results/<pb_name>/
├── 000_question.txt
├── 000_global_cot.txt
├── 001_text.txt
├── 001_00_image.png
├── 001_00_image_cot.txt
├── 002_text.txt
├── 002_00_image.png
├── ...
└── video.mp4              # only when --video is enabled

Each *_text.txt stores decoded segments, *_image.png stores generated frames, and matching *_image_cot.txt keeps image-level chain-of-thought notes when available.

3. Gradio Demo

We provide two Gradio Demos for different application scenarios:

Emu3.5-Image Demo —— Interactive interface optimized for Text-to-Image (T2I) and Any-to-Image (X2I) tasks:

CUDA_VISIBLE_DEVICES=0,1 python gradio_demo_image.py --host 0.0.0.0 --port 7860

Emu3.5-Interleave Demo —— Launch Emu3.5 Interleave Tasks (Visual Guidance and Visual Narrate) Gradio Demo

CUDA_VISIBLE_DEVICES=0,1 python gradio_demo_interleave.py --host 0.0.0.0 --port 7860

Features

  • Image Generation: Support Text-to-Image Generation and Multimodal Image Generation
  • Interleaved Generation: Support long-sequence creation
View on GitHub
GitHub Stars1.5k
CategoryDevelopment
Updated2h ago
Forks62

Languages

Python

Security Score

95/100

Audited on Apr 7, 2026

No findings