SkillAgentSearch skills...

OpenCUA

OpenCUA: Open Foundations for Computer-Use Agents

Install / Use

/learn @xlang-ai/OpenCUA
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<h1 style=" font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Helvetica,Arial,sans-serif; font-size:48px; font-weight:700; line-height:1.25; text-align:center; margin:0 0 24px;"> OpenCUA: Open Foundations for Computer-Use Agents </h1> <p align="center"> &nbsp&nbsp🌐 <a href="https://opencua.xlang.ai/">Website</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2508.09123">Paper</a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/datasets/xlangai/AgentNet">Dataset</a>&nbsp&nbsp | &nbsp&nbsp🔎 <a href="https://agentnet_data_viewer.xlang.ai/">Data Viewer</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://huggingface.co/collections/xlangai/opencua-open-foundations-for-computer-use-agents-6882014ebecdbbe46074a68d">Model</a>&nbsp&nbsp | &nbsp&nbsp🔧 <a href="https://agentnet-tool.xlang.ai/">Tool</a>&nbsp&nbsp | &nbsp&nbsp🎮 <a href="https://huggingface.co/spaces/xlangai/OpenCUA-demo">Model Demo</a>&nbsp&nbsp </p> <div align="center"> <img src="assets/images/main_fig.png" width="600" alt="OpenCUA-7B Performance Scaling"> </div> <div style="max-width:900px;margin:0 auto;">

📢 Updates

  • 2026-01-17: 🎉 vLLM now fully supports OpenCUA-7B, OpenCUA-32B, and OpenCUA-72B! Thanks to the Meituan EvoCUA Team for their contributions to vLLM integration. See vLLM Serve for usage instructions.

  • 2025-12-17: You can now view AgentNet dataset trajectories online via AgentNet Data Viewer, or use the code in data/vis/ to visualize your own trajectory data. See vis/README.md for usage instructions. We also summarized the metadata of AgentNet here Metadata json.

  • 2025-11-28: VLLM support of OpenCUA is available at [Model] Add OpenCUA-7B support #29068. Super grateful to lim4349 !

  • 2025-10-12: <span style="font-weight:bold">OpenCUA-7B-exl2 is now live!</span> ⚡️
    Thanks to Sujit Vasanth for producing a quantized exllamav2 version of OpenCUA-7B — enabling much faster inference with lower VRAM usage.

  • 2025-10-03: <span style="color:red; font-weight:bold">New OpenCUA model!</span>🔥 OpenCUA-72B now ranks #1 on the OSWorld-Verified leaderboard. It also has strong grounding ability, 37.3% (SOTA) on UI-Vision and 60.8% on ScreenSpot-Pro.

  • 2025-08-13: We released our paper and project page. Check it out!

Introduction

<div style=" max-width: 880px; /* 可按需调节整体宽度 */ margin: 0 auto; /* 居中容器 */ text-align: justify; /* 关键:两端对齐 */ text-justify: inter-word; /* 优化英文对齐效果 */ line-height: 1.6;">

<b>OpenCUA</b> is a comprehensive open-source framework for scaling CUA data and foundation models, consisting of:

  • <b>AgentNet</b>: the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites;
  • AgentNetTool: an annotation infrastructure that seamlessly captures human computer-use demonstrations;
  • <b>AgentNetBench</b>: an offline evaluator that benchmarks model-predicted low-level actions against ground-truth trajectories.
  • OpenCUA Models: end-to-end computer-use foundation models than can produce executable actions in the computer environments with great planning and grounding capabilities.

With the help of OpenCUA framework, our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, <b>OpenCUA-72B</b> achieves an average success rate of 45.0% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models.

</div>

🚀 Quick Start of OpenCUA Models

<div style="border-left: 6px solid #f28c28; background: #fff8e6; padding: 12px 16px; margin: 16px 0;"> <strong>⚠️ Important for Qwen-based Models (OpenCUA-7B, OpenCUA-32B, OpenCUA-72B):</strong>

To align with our training infrastructure, we have modified the model in two places:

<ul style="margin-top: 8px;"> <li>1. Multimodal Rotary Position Embedding (M-RoPE) has been replaced with 1D RoPE</strong>.</li> <li>2. Using the same Tokenizer and ChatTemplate as Kimi-VL.</li> <li>Do not use the default transformers and vllm classes to load the model. Tokenizer and Chat Template should be aligned if training the models.</li> </ul> </div>

Installation & Download

First, install the required transformers dependencies:

conda create -n opencua python=3.10
conda activate opencua
pip install -r requirement.txt

Download the model weight from huggingface:

from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="xlangai/OpenCUA-7B",
    local_dir="OpenCUA-7B",                
    local_dir_use_symlinks=False  
)

🚀 vLLM Serve

We recommend using vLLM for production deployment. Requires vllm>=0.12.0 with --trust-remote-code.

# OpenCUA-7B (single GPU)
vllm serve xlangai/OpenCUA-7B \
  --trust-remote-code \
  --served-model-name opencua-7b \
  --host 0.0.0.0 \
  --port 8000

# OpenCUA-32B (4 GPUs, tensor parallel)
vllm serve xlangai/OpenCUA-32B \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --served-model-name opencua-32b \
  --host 0.0.0.0 \
  --port 8000

# OpenCUA-72B with data parallelism (tp=2, dp=4 for 4 instances on 8 GPUs)
vllm serve xlangai/OpenCUA-72B \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --data-parallel-size 4 \
  --gpu-memory-utilization 0.85 \
  --host 0.0.0.0 \
  --port 8000

Adjust --tensor-parallel-size, --data-parallel-size, and --gpu-memory-utilization based on your hardware configuration.

For more examples and inference code, see model/inference/vllm_inference.py.

🎯 GUI Grounding

First, start the vLLM server (using OpenCUA-7B as example):

vllm serve xlangai/OpenCUA-7B \
  --trust-remote-code \
  --served-model-name opencua-7b \
  --host 0.0.0.0 \
  --port 8000

Then run the grounding examples:

cd ./model/inference/
python vllm_inference.py

Or with HuggingFace Transformers (no server required):

python huggingface_inference.py

🖥️ Computer Use Agent

OpenCUAAgent is developed in the OSWorld environment based on OpenCUA models. It iteratively perceives the environment via screenshots, produces reflective long CoT as inner monologue, and predicts the next action to be executed. OpenCUAAgent uses 3 images in total and L2 CoT format in default.

Command for running OpenCUA-7B and OpenCUA-32B in OSWorld:

    python run_multienv_opencua.py \
        --headless \
        --observation_type screenshot \
        --model OpenCUA-32B \
        --result_dir ./results --test_all_meta_path evaluation_examples/test_all_no_gdrive.json \
        --max_steps 100 \
        --num_envs 30  \
        --coordinate_type qwen25

Performance

Online Agent Evaluation

OpenCUA models achieves strong performance on OSWorld-Verified. OPENCUA-32B achieves the best performance among all open-source models with an average success rate of 34.8%, outperforming prior baselines by large margins. It also closes the gap to proprietary Claude models.

<div align="center">

| Model | 15 Steps | 50 Steps | 100 Steps | |-------------------------------|:--------:|:--------:|:---------:| | Proprietary | | | | | OpenAI CUA | 26.0 | 31.3 | 31.4 | | Seed 1.5-VL | 27.9 | — | 34.1 | | Claude 3.7 Sonnet | 27.1 | 35.8 | 35.9 | | Claude 4 Sonnet | 31.2 | 43.9 | 41.5 | | Open-Source | | | | | Qwen 2.5-VL-32B-Instruct | 3.0 | — | 3.9 | | Qwen 2.5-VL-72B-Instruct | 4.4 | — | 5.0 | | Kimi-VL-A3B | 9.7 | — | 10.3 | | UI-TARS-72B-DPO | 24.0 | 25.8 | 27.1 | | UI-TARS-1.5-7B | 24.5 | 27.3 | 27.4 | | OpenCUA-7B (Ours) | 24.3 | 27.9 | 26.6 | | OpenCUA-32B (Ours) | 29.7 | 34.1 | 34.8 | | OpenCUA-72B(Ours)** | 39.0 | 44.9 | 45.0 |

</div>

OpenCUA scores are the mean of 3 independent runs.

GUI Grounding Performance

<div align="center">

| Model | OSWorld-G | ScreenSpot-V2 | ScreenSpot-Pro | UI-Vision | |-------|-----------|---------------|----------------| ---------- | | Qwen2.5-VL-7B | 31.4 | 88.8 | 27.6 | 0.85 | | Qwen2.5-VL-32B | 46.5 | 87.0 | 39.4 | - | | UI-TARS-72B | 57.1 | 90.3 | 38.1 | 25.5 | | OpenCUA-7B | 55.3 | 92.3 | 50.0 | 29.7 | | OpenCUA-32B | 59.6 | 93.4 | 55.3 | 33.3 | | OpenCUA-72B | 59.2 | 92.9 | 60.8 | 37.3 |

</div>

AgentNetBench (Offline Evaluation)

<div align="center">

| Model | Coordinate Actions | Content Actions | Function Actions | Average | |-------|-------------------|-----------------|------------------|---------| | Qwen2.5-VL-7B | 50.7 | 40.8 | 3.1 | 48.0

Related Skills

View on GitHub
GitHub Stars724
CategoryDevelopment
Updated4d ago
Forks95

Languages

Python

Security Score

100/100

Audited on Mar 27, 2026

No findings