SkillAgentSearch skills...

WindowsAgentArena

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.

Install / Use

/learn @microsoft/WindowsAgentArena
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center">

Banner Website arXiv License PRs

</div>

Windows Agent Arena (WAA) 🪟 is a scalable Windows AI agent platform for testing and benchmarking multi-modal, desktop AI agents. WAA provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, where agentic AI workflows can be tested across a diverse range of tasks.

WAA supports the deployment of agents at scale using the Azure ML cloud infrastructure, allowing for the parallel running of multiple agents and delivering quick benchmark results for hundreds of tasks in minutes, not days.

<div align="center"> <video src="https://github.com/user-attachments/assets/e0a8d88d-d28a-493d-b74f-2455f36c21f1" alt="waa_intro"> </div>

📢 Updates

  • 2024-11-10: We added a new difficulty mode for Windows Agent Arena! You can try the new harder difficulty mode by changing the default diff_lvl="normal" to diff_lvl="hard" in src/win-arena-container/start_client.sh. Under the harder difficulty, in many tasks, agents must also learn to initialize/set up the task themselves (e.g., finding and opening the right program/application for the task) rather than have the task "set up" for them by the task config.
  • 2024-10-30: We released the code for our Navi agent with Omniparser! For the top performing mode in the paper, run ./run-local.sh --som-origin mixed-omni --gpu-enabled true
  • 2024-10-23: Microsoft open-sourced Omniparser, the current top performing screen understanding model in our benchmark.
  • 2024-09-13: We released our paper, code, project page, and blog post. Check it out!

📚 Citation

Our technical report paper can be found here. If you find this environment useful, please consider citing our work:

@article{bonatti2024windows,
author = { Bonatti, Rogerio and Zhao, Dan and Bonacci, Francesco and Dupont, Dillon, and Abdali, Sara and Li, Yinheng and Wagle, Justin and Koishida, Kazuhito and Bucker, Arthur and Jang, Lawrence and Hui, Zack},
title = {Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale},
institution = {Microsoft},
year = {2024},
month = {September}, 
}

☝️ Pre-requisites:

<div align="center"> <img src="img/main.png" alt="main" height="200"/> </div>
  • Docker daemon installed and running. On Windows, we recommend using Docker with WSL 2.
  • An OpenAI or Azure OpenAI API Key.
  • Python 3.9 - we recommend using Conda and creating an adhoc python environment for running the scripts. For creating a new environment run conda create -n winarena python=3.9.

Clone the repository and install dependencies:

git clone https://github.com/microsoft/WindowsAgentArena.git
cd WindowsAgentArena
# Install the required dependencies in your python environment
# conda activate winarena
pip install -r requirements.txt

💻 Local deployment (WSL or Linux)

1. Configuration file

Create a new config.json at the root of the project with the necessary keys (from OpenAI or Azure endpoints):

{
    "OPENAI_API_KEY": "<OPENAI_API_KEY>", // if you are using OpenAI endpoint
    "AZURE_API_KEY": "<AZURE_API_KEY>",  // if you are using Azure endpoint
    "AZURE_ENDPOINT": "https://yourendpoint.openai.azure.com/", // if you are using Azure endpoint
}

2. Prepare the Windows Arena Docker Image

2.1 Pull the WinArena-Base Image from Docker Hub

To get started, pull the base image from Docker Hub:

docker pull windowsarena/winarena-base:latest

This image includes all the necessary dependencies (such as packages and models) required to run the code in the src directory.

2.2 Build the WinArena Image Locally

Next, build the WinArena image locally:

cd scripts
./build-container-image.sh

# If there are any changes in 'Dockerfile-WinArena-Base', use the --build-base-image flag to build also the base image locally
# ./build-container-image.sh --build-base-image true

# For other build options:
# ./build-container-image.sh --help

This will create the windowsarena/winarena:latest image with the latest code from the src directory.

3. Prepare the Windows 11 VM

<div align="center"> <video src="https://github.com/user-attachments/assets/6d55b9b5-3242-49af-be20-64f2086108b9" height="500" alt="local_prepare_golden_image"> </div>

3.1 Download Windows 11 Evaluation .iso file:

  1. Visit Microsoft Evaluation Center, accept the Terms of Service, and download a Windows 11 Enterprise Evaluation (90-day trial, English, United States) ISO file [~6GB]
  2. After downloading, rename the file to setup.iso and copy it to the directory WindowsAgentArena/src/win-arena-container/vm/image

3.2 Automatic Setup of the Windows 11 golden image:

Before running the arena, you need to prepare a new WAA snapshot (also referred as WAA golden image). This 30GB snapshot represents a fully functional Windows 11 VM with all the programs needed to run the benchmark. This VM additionally hosts a Python server which receives and executes agent commands. To learn more about the components at play, see our local and cloud components diagrams.

To prepare the gold snapshot, run once:

cd ./scripts
./run-local.sh --prepare-image true

You can monitor progress at http://localhost:8006. The preparation process is fully automated and will take ~20 minutes.

Please do not interfere with the VM while it is being prepared. It will automatically shut down when the provisioning process is complete.

<div align="center"> <img src="img/local_prepare_screen_unattend.png" alt="local_prepare_screen_unattend" height="500"/> </div> <div align="center"> <img src="img/local_prepare_screen_setup.png" alt="local_prepare_screen_setup" height="500"/> </div>

At the end, you should expect the Docker container named winarena to gracefully terminate as shown from the below logs.

<div align="center"> <img src="img/local_prepare_logs_successful.png" alt="local_prepare_logs_successful" height="200"/> </div> <br/>

You will find the 30GB WAA golden image in WindowsAgentArena/src/win-arena-container/vm/storage, consisting of the following files:

<div align="center"> <img src="img/local_prepare_storage_successful.png" alt="run_local_prepare_storage_successful" height="200"/> </div> <br/>
Additional Notes
  • During development, if you want to include any changes made in the src/win-arena-container directory in the WAA golden image, please ensure to specify the flag --skip-build false to the run-local.sh script (default to true). This will ensure that a new container image is built instead than using the prebuilt windowsarena/winarena:latest image.
  • If you have previously run an installation process and want to do it again from scratch, make sure to delete the content of storage.
  • We recommend copying this storage folder to a safe location outside of the repository in case you or the agent accidentally corrupt the VM at some point and you want to avoid a fresh setup.
  • Depending on your docker settings, you might have to run the above command with sudo.
  • Running on WSL2? If you encounter the error /bin/bash: bad interpreter: No such file or directory, we recommend converting the bash scripts from DOS/Windows format to Unix format:
cd ./scripts
find . -maxdepth 1 -type f -exec dos2unix {} +

4. Deploying the agent in the arena

4.1 Running the base benchmark

You're now ready to launch the evaluation. To run the baseline agent on all benchmark tasks, do:

cd scripts
./run-local.sh
# For client/agent options:
# ./run-local.sh --help

Open http://localhost:8006 to see the Windows VM with the agent running. If you have a beefy PC, you can instead run the strongest agent configuration in our paper by doing:

./run-local.sh --gpu-enabled true --som-origin mixed-omni --a11y-backend uia

At the end of the run you can display the results using the command:

cd src/win-arena-container/client
python show_results.py
View on GitHub
GitHub Stars845
CategoryDevelopment
Updated1d ago
Forks90

Languages

Python

Security Score

100/100

Audited on Mar 27, 2026

No findings