gWorld

Generative Visual Code Mobile World Model

gWorld is the first open-weight, single self-contained Vision-Language Model (VLM) specialized for visual mobile GUI world modeling. It predicts the next GUI state as executable web code rather than generating pixels directly. Available in two sizes: gWorld-8B and gWorld-32B.

gWorld Overview

Key Features

Action-conditioned next-state prediction for mobile GUIs
Pixel-perfect text rendering and structurally accurate layouts
Code-based generation overcomes hallucination and legibility issues of pixel-generation models
Fast rendering (~0.3s via Playwright, faster than multi-step diffusion pipelines)
Low failure rate (<1% render failure rate)

Model Variants

| Property | gWorld-8B | gWorld-32B | |----------|-----------|------------| | Base Model | Qwen/Qwen3-VL-8B-Instruct | Qwen/Qwen3-VL-32B | | Parameters | 9B | 33B | | Tensor Type | BF16 | BF16 | | License | Apache 2.0 | Apache 2.0 | | HuggingFace | trillionlabs/gWorld-8B | trillionlabs/gWorld-32B |

Performance

Both models achieve state-of-the-art efficiency on mobile world modeling:

gWorld-8B

Outperforms frontier models up to 50.25x larger (e.g., Llama 4 402B-A17B)
+45.7% gain in Instruction Accuracy over base Qwen3-VL
High zero-shot performance on AndroidWorld and KApps (Korean) benchmarks

gWorld-32B

Outperforms frontier models up to 12.6x larger (e.g., Llama 4 402B-A17B)
+27.1% gain in Instruction Accuracy over base Qwen3-VL
High zero-shot generalization on out-of-distribution benchmarks

Pareto Frontier

Installation

pip install -r requirements.txt

# Install Playwright browsers
playwright install chromium

Requirements

Python 3.10+
CUDA-compatible GPU (8x GPUs recommended for tensor parallelism)
Dependencies: torch, transformers, vllm, Pillow, playwright, tqdm

Usage

Input Format

gWorld takes two inputs:

Current screenshot - Mobile GUI image
User action - JSON formatted action string (action types vary by benchmark)

Coordinate Space: Normalized [0, 1000] scale

Output Format

The model outputs:

Next State Reasoning - Logical explanation of expected changes
HTML Code - Renderable HTML/CSS representing the next GUI state

# Next State Reasoning: <your reasoning about what the next state should look like>
# HTML: <valid_html_code>

Inference with vLLM

from vllm import LLM, SamplingParams
from transformers import AutoProcessor
from PIL import Image

# Model configuration (choose one)
# For gWorld-8B:
MODEL_PATH = "trillionlabs/gWorld-8B"
BASE_MODEL = "Qwen/Qwen3-VL-8B-Instruct"

# For gWorld-32B:
# MODEL_PATH = "trillionlabs/gWorld-32B"
# BASE_MODEL = "Qwen/Qwen3-VL-32B"

# Image processing settings
MM_PROCESSOR_KWARGS = {
    "max_pixels": 4233600,
    "min_pixels": 3136,
}

# Load model
llm = LLM(
    model=MODEL_PATH,
    tokenizer=BASE_MODEL,
    tensor_parallel_size=8,
    gpu_memory_utilization=0.9,
    max_model_len=19384,
    trust_remote_code=True,
    mm_processor_kwargs=MM_PROCESSOR_KWARGS,
    enable_chunked_prefill=True,
    max_num_batched_tokens=16384,
)

# Load processor for chat template
processor = AutoProcessor.from_pretrained(BASE_MODEL, trust_remote_code=True)

# Prepare input
image = Image.open("screenshot.png")
if image.mode != 'RGB':
    image = image.convert('RGB')

action = '{"action_type": "TAP", "coordinates": [512, 890]}'

# World model prompt template
user_content = f"""You are an expert mobile UI World Model that can accurately predict the next state given an action.
Given a screenshot of a mobile interface and an action, you must generate clean, responsive HTML code that represents the state of the interface AFTER the action is performed.
First generate reasoning about what the next state should look like based on the action.
Afterwards, generate the HTML code representing the next state that logically follows the action.
You will render this HTML in a mobile viewport to see how similar it looks and acts like the mobile screenshot.

Requirements:
1. Provide reasoning about what the next state should look like based on the action
2. Generate complete, valid HTML5 code
3. Choose between using inline CSS and utility classes from Bootstrap, Tailwind CSS, or MUI for styling, depending on which option generates the closest code to the screenshot.
4. Use mobile-first design principles matching screenshot dimensions.
5. For images, use inline SVG placeholders with explicit width and height attributes that match the approximate dimensions from the screenshot. Matching the approximate color is also good.
6. Use modern web standards and best practices
7. Return ONLY the HTML code, no explanations or markdown formatting
8. The generated HTML should render properly in a mobile viewport.
9. Generated HTML should look like the screen that logically follows the current screen and the action.

Action:
{action}

Output format:
# Next State Reasoning: <your reasoning about what the next state should look like>
# HTML: <valid_html_code>

Generate the next state reasoning and the next state in html:"""

# Build messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": user_content},
        ],
    }
]

# Apply chat template
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

# Generation parameters
sampling_params = SamplingParams(
    max_tokens=15000,
    temperature=0,
    seed=42,
    top_p=1.0,
)

# Generate
outputs = llm.generate(
    [{"prompt": prompt, "multi_modal_data": {"image": image}}],
    sampling_params=sampling_params
)

print(outputs[0].outputs[0].text)

Rendering HTML Output

Use Playwright to render the generated HTML with proper viewport scaling:

from playwright.sync_api import sync_playwright
from PIL import Image

def get_scale_factor_for_size(ref_width: int, ref_height: int) -> float:
    """Get appropriate scale factor for given image dimensions."""
    size_to_scale = {
        (1080, 2400): 3.0,
        (1440, 3120): 4.0,
        (1440, 3040): 4.0,
        (720, 1280): 2.0,
        (1344, 2992): 3.0,
        (1440, 2960): 4.0,
        (1080, 2280): 3.0,
        (1080, 2160): 3.0,
        (2560, 1600): 2.0,
        (1600, 2560): 2.0,
    }

    if (ref_width, ref_height) in size_to_scale:
        return size_to_scale[(ref_width, ref_height)]

    # Default heuristic for portrait/landscape
    is_portrait = ref_height > ref_width
    if is_portrait:
        for scale in [4.0, 3.0, 2.5, 2.0, 1.5]:
            logical_w = int(ref_width / scale)
            logical_h = int(ref_height / scale)
            if 300 <= logical_w <= 500 and 500 <= logical_h <= 1200:
                return scale
    else:
        for scale in [2.0, 1.5, 3.0]:
            logical_w = int(ref_width / scale)
            logical_h = int(ref_height / scale)
            if 600 <= logical_w <= 1200 and 400 <= logical_h <= 800:
                return scale

    return 2.0

def render_html(html_code: str, reference_image_path: str, output_path: str):
    """Render HTML to image matching reference image dimensions."""
    # Get reference image dimensions
    ref_img = Image.open(reference_image_path)
    ref_width, ref_height = ref_img.size

    # Calculate viewport size
    scale_factor = get_scale_factor_for_size(ref_width, ref_height)
    viewport_width = int(ref_width / scale_factor)
    viewport_height = int(ref_height / scale_factor)

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            viewport={'width': viewport_width, 'height': viewport_height},
            device_scale_factor=scale_factor
        )
        page = context.new_page()
        page.set_content(html_code)
        page.wait_for_load_state('networkidle')
        page.screenshot(path=output_path, full_page=False)
        browser.close()

Configuration Reference

| Parameter | gWorld-8B | gWorld-32B | |-----------|-----------|------------| | tensor_parallel_size | 8 | 8 | | gpu_memory_utilization | 0.9 | 0.9 | | max_model_len | 19384 | 19384 | | max_tokens | 15000 | 15000 | | temperature | 0 | 0 | | top_p | 1.0 | 1.0 | | max_pixels | 4233600 | 4233600 | | min_pixels | 3136 | 3136 |

Evaluation

Evaluation scripts are provided in src/:

# Evaluate on AndroidWorld benchmark
python src/eval_androidworld.py

# Evaluate on KApps (Korean) benchmark
python src/eval_kapps.py

# Evaluate on test splits
python src/eval_test_splits.py

Examples

Example Predictions

Citation

@misc{koh2026generativevisualcodemobile,
      title={Generative Visual Code Mobile World Models},
      author={Woosung Koh and Sungjun Han and Segyu Lee and Se-Young Yun and Jamin Shin},
      year={2026},
      eprint={2602.01576},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.01576},
}

License

Apache License 2.0

GWorld

Install / Use

README