OpenCUA

OpenCUA: Open Foundations for Computer-Use Agents

Generate Convert Improve

Install / Use

/learn @xlang-ai/OpenCUA

About this skill

Quality Score

0/100

README

<h1 style=" font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Helvetica,Arial,sans-serif; font-size:48px; font-weight:700; line-height:1.25; text-align:center; margin:0 0 24px;"> OpenCUA: Open Foundations for Computer-Use Agents </h1> <p align="center"> &nbsp&nbsp🌐 <a href="https://opencua.xlang.ai/">Website</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2508.09123">Paper</a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/datasets/xlangai/AgentNet">Dataset</a>&nbsp&nbsp | &nbsp&nbsp🔎 <a href="https://agentnet_data_viewer.xlang.ai/">Data Viewer</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://huggingface.co/collections/xlangai/opencua-open-foundations-for-computer-use-agents-6882014ebecdbbe46074a68d">Model</a>&nbsp&nbsp | &nbsp&nbsp🔧 <a href="https://agentnet-tool.xlang.ai/">Tool</a>&nbsp&nbsp | &nbsp&nbsp🎮 <a href="https://huggingface.co/spaces/xlangai/OpenCUA-demo">Model Demo</a>&nbsp&nbsp </p> <div align="center"> <img src="assets/images/main_fig.png" width="600" alt="OpenCUA-7B Performance Scaling"> </div> <div style="max-width:900px;margin:0 auto;">

📢 Updates

2026-01-17: 🎉 vLLM now fully supports OpenCUA-7B, OpenCUA-32B, and OpenCUA-72B! Thanks to the Meituan EvoCUA Team for their contributions to vLLM integration. See vLLM Serve for usage instructions.
2025-12-17: You can now view AgentNet dataset trajectories online via AgentNet Data Viewer, or use the code in data/vis/ to visualize your own trajectory data. See vis/README.md for usage instructions. We also summarized the metadata of AgentNet here Metadata json.
2025-11-28: VLLM support of OpenCUA is available at [Model] Add OpenCUA-7B support #29068. Super grateful to lim4349 !
2025-10-12: <span style="font-weight:bold">OpenCUA-7B-exl2 is now live!</span> ⚡️
Thanks to Sujit Vasanth for producing a quantized exllamav2 version of OpenCUA-7B — enabling much faster inference with lower VRAM usage.
2025-10-03: <span style="color:red; font-weight:bold">New OpenCUA model!</span>🔥 OpenCUA-72B now ranks #1 on the OSWorld-Verified leaderboard. It also has strong grounding ability, 37.3% (SOTA) on UI-Vision and 60.8% on ScreenSpot-Pro.
2025-08-13: We released our paper and project page. Check it out!

Introduction

<b>OpenCUA</b> is a comprehensive open-source framework for scaling CUA data and foundation models, consisting of:

<b>AgentNet</b>: the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites;
AgentNetTool: an annotation infrastructure that seamlessly captures human computer-use demonstrations;
<b>AgentNetBench</b>: an offline evaluator that benchmarks model-predicted low-level actions against ground-truth trajectories.
OpenCUA Models: end-to-end computer-use foundation models than can produce executable actions in the computer environments with great planning and grounding capabilities.

With the help of OpenCUA framework, our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, <b>OpenCUA-72B</b> achieves an average success rate of 45.0% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models.

</div>

🚀 Quick Start of OpenCUA Models

<div style="border-left: 6px solid #f28c28; background: #fff8e6; padding: 12px 16px; margin: 16px 0;"> <strong>⚠️ Important for Qwen-based Models (OpenCUA-7B, OpenCUA-32B, OpenCUA-72B):</strong>

To align with our training infrastructure, we have modified the model in two places:

<ul style="margin-top: 8px;"> <li>1. Multimodal Rotary Position Embedding (M-RoPE) has been replaced with 1D RoPE</strong>.</li> <li>2. Using the same Tokenizer and ChatTemplate as Kimi-VL.</li> <li>Do not use the default transformers and vllm classes to load the model. Tokenizer and Chat Template should be aligned if training the models.</li> </ul> </div>

Installation & Download

First, install the required transformers dependencies:

conda create -n opencua python=3.10
conda activate opencua
pip install -r requirement.txt

Download the model weight from huggingface:

from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="xlangai/OpenCUA-7B",
    local_dir="OpenCUA-7B",                
    local_dir_use_symlinks=False  
)

🚀 vLLM Serve

We recommend using vLLM for production deployment. Requires vllm>=0.12.0 with --trust-remote-code.

# OpenCUA-7B (single GPU)
vllm serve xlangai/OpenCUA-7B \
  --trust-remote-code \
  --served-model-name opencua-7b \
  --host 0.0.0.0 \
  --port 8000

# OpenCUA-32B (4 GPUs, tensor parallel)
vllm serve xlangai/OpenCUA-32B \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --served-model-name opencua-32b \
  --host 0.0.0.0 \
  --port 8000

# OpenCUA-72B with data parallelism (tp=2, dp=4 for 4 instances on 8 GPUs)
vllm serve xlangai/OpenCUA-72B \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --data-parallel-size 4 \
  --gpu-memory-utilization 0.85 \
  --host 0.0.0.0 \
  --port 8000

Adjust --tensor-parallel-size, --data-parallel-size, and --gpu-memory-utilization based on your hardware configuration.

For more examples and inference code, see model/inference/vllm_inference.py.

🎯 GUI Grounding

First, start the vLLM server (using OpenCUA-7B as example):

vllm serve xlangai/OpenCUA-7B \
  --trust-remote-code \
  --served-model-name opencua-7b \
  --host 0.0.0.0 \
  --port 8000

Then run the grounding examples:

cd ./model/inference/
python vllm_inference.py

Or with HuggingFace Transformers (no server required):

python huggingface_inference.py

🖥️ Computer Use Agent

OpenCUAAgent is developed in the OSWorld environment based on OpenCUA models. It iteratively perceives the environment via screenshots, produces reflective long CoT as inner monologue, and predicts the next action to be executed. OpenCUAAgent uses 3 images in total and L2 CoT format in default.

Command for running OpenCUA-7B and OpenCUA-32B in OSWorld:

    python run_multienv_opencua.py \
        --headless \
        --observation_type screenshot \
        --model OpenCUA-32B \
        --result_dir ./results --test_all_meta_path evaluation_examples/test_all_no_gdrive.json \
        --max_steps 100 \
        --num_envs 30  \
        --coordinate_type qwen25

Performance

Online Agent Evaluation

OpenCUA models achieves strong performance on OSWorld-Verified. OPENCUA-32B achieves the best performance among all open-source models with an average success rate of 34.8%, outperforming prior baselines by large margins. It also closes the gap to proprietary Claude models.

| Model | 15 Steps | 50 Steps | 100 Steps | |-------------------------------|:--------:|:--------:|:---------:| | Proprietary | | | | | OpenAI CUA | 26.0 | 31.3 | 31.4 | | Seed 1.5-VL | 27.9 | — | 34.1 | | Claude 3.7 Sonnet | 27.1 | 35.8 | 35.9 | | Claude 4 Sonnet | 31.2 | 43.9 | 41.5 | | Open-Source | | | | | Qwen 2.5-VL-32B-Instruct | 3.0 | — | 3.9 | | Qwen 2.5-VL-72B-Instruct | 4.4 | — | 5.0 | | Kimi-VL-A3B | 9.7 | — | 10.3 | | UI-TARS-72B-DPO | 24.0 | 25.8 | 27.1 | | UI-TARS-1.5-7B | 24.5 | 27.3 | 27.4 | | OpenCUA-7B (Ours) | 24.3 | 27.9 | 26.6 | | OpenCUA-32B (Ours) | 29.7 | 34.1 | 34.8 | | OpenCUA-72B(Ours)** | 39.0 | 44.9 | 45.0 |

</div>

OpenCUA scores are the mean of 3 independent runs.

GUI Grounding Performance

| Model | OSWorld-G | ScreenSpot-V2 | ScreenSpot-Pro | UI-Vision | |-------|-----------|---------------|----------------| ---------- | | Qwen2.5-VL-7B | 31.4 | 88.8 | 27.6 | 0.85 | | Qwen2.5-VL-32B | 46.5 | 87.0 | 39.4 | - | | UI-TARS-72B | 57.1 | 90.3 | 38.1 | 25.5 | | OpenCUA-7B | 55.3 | 92.3 | 50.0 | 29.7 | | OpenCUA-32B | 59.6 | 93.4 | 55.3 | 33.3 | | OpenCUA-72B | 59.2 | 92.9 | 60.8 | 37.3 |

</div>

AgentNetBench (Offline Evaluation)

| Model | Coordinate Actions | Content Actions | Function Actions | Average | |-------|-------------------|-----------------|------------------|---------| | Qwen2.5-VL-7B | 50.7 | 40.8 | 3.1 | 48.0

Related Skills

node-connect

343.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

90.0k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

343.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

343.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。