WindowsAgentArena
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
Install / Use
/learn @microsoft/WindowsAgentArenaREADME
Windows Agent Arena (WAA) 🪟 is a scalable Windows AI agent platform for testing and benchmarking multi-modal, desktop AI agents. WAA provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, where agentic AI workflows can be tested across a diverse range of tasks.
WAA supports the deployment of agents at scale using the Azure ML cloud infrastructure, allowing for the parallel running of multiple agents and delivering quick benchmark results for hundreds of tasks in minutes, not days.
<div align="center"> <video src="https://github.com/user-attachments/assets/e0a8d88d-d28a-493d-b74f-2455f36c21f1" alt="waa_intro"> </div>📢 Updates
- 2024-11-10: We added a new difficulty mode for Windows Agent Arena! You can try the new harder difficulty mode by changing the default
diff_lvl="normal"todiff_lvl="hard"insrc/win-arena-container/start_client.sh. Under the harder difficulty, in many tasks, agents must also learn to initialize/set up the task themselves (e.g., finding and opening the right program/application for the task) rather than have the task "set up" for them by the task config. - 2024-10-30: We released the code for our Navi agent with Omniparser! For the top performing mode in the paper, run
./run-local.sh --som-origin mixed-omni --gpu-enabled true - 2024-10-23: Microsoft open-sourced Omniparser, the current top performing screen understanding model in our benchmark.
- 2024-09-13: We released our paper, code, project page, and blog post. Check it out!
📚 Citation
Our technical report paper can be found here. If you find this environment useful, please consider citing our work:
@article{bonatti2024windows,
author = { Bonatti, Rogerio and Zhao, Dan and Bonacci, Francesco and Dupont, Dillon, and Abdali, Sara and Li, Yinheng and Wagle, Justin and Koishida, Kazuhito and Bucker, Arthur and Jang, Lawrence and Hui, Zack},
title = {Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale},
institution = {Microsoft},
year = {2024},
month = {September},
}
☝️ Pre-requisites:
<div align="center"> <img src="img/main.png" alt="main" height="200"/> </div>- Docker daemon installed and running. On Windows, we recommend using Docker with WSL 2.
- An OpenAI or Azure OpenAI API Key.
- Python 3.9 - we recommend using Conda and creating an adhoc python environment for running the scripts. For creating a new environment run
conda create -n winarena python=3.9.
Clone the repository and install dependencies:
git clone https://github.com/microsoft/WindowsAgentArena.git
cd WindowsAgentArena
# Install the required dependencies in your python environment
# conda activate winarena
pip install -r requirements.txt
💻 Local deployment (WSL or Linux)
1. Configuration file
Create a new config.json at the root of the project with the necessary keys (from OpenAI or Azure endpoints):
{
"OPENAI_API_KEY": "<OPENAI_API_KEY>", // if you are using OpenAI endpoint
"AZURE_API_KEY": "<AZURE_API_KEY>", // if you are using Azure endpoint
"AZURE_ENDPOINT": "https://yourendpoint.openai.azure.com/", // if you are using Azure endpoint
}
2. Prepare the Windows Arena Docker Image
2.1 Pull the WinArena-Base Image from Docker Hub
To get started, pull the base image from Docker Hub:
docker pull windowsarena/winarena-base:latest
This image includes all the necessary dependencies (such as packages and models) required to run the code in the src directory.
2.2 Build the WinArena Image Locally
Next, build the WinArena image locally:
cd scripts
./build-container-image.sh
# If there are any changes in 'Dockerfile-WinArena-Base', use the --build-base-image flag to build also the base image locally
# ./build-container-image.sh --build-base-image true
# For other build options:
# ./build-container-image.sh --help
This will create the windowsarena/winarena:latest image with the latest code from the src directory.
3. Prepare the Windows 11 VM
<div align="center"> <video src="https://github.com/user-attachments/assets/6d55b9b5-3242-49af-be20-64f2086108b9" height="500" alt="local_prepare_golden_image"> </div>3.1 Download Windows 11 Evaluation .iso file:
- Visit Microsoft Evaluation Center, accept the Terms of Service, and download a Windows 11 Enterprise Evaluation (90-day trial, English, United States) ISO file [~6GB]
- After downloading, rename the file to
setup.isoand copy it to the directoryWindowsAgentArena/src/win-arena-container/vm/image
3.2 Automatic Setup of the Windows 11 golden image:
Before running the arena, you need to prepare a new WAA snapshot (also referred as WAA golden image). This 30GB snapshot represents a fully functional Windows 11 VM with all the programs needed to run the benchmark. This VM additionally hosts a Python server which receives and executes agent commands. To learn more about the components at play, see our local and cloud components diagrams.
To prepare the gold snapshot, run once:
cd ./scripts
./run-local.sh --prepare-image true
You can monitor progress at http://localhost:8006. The preparation process is fully automated and will take ~20 minutes.
Please do not interfere with the VM while it is being prepared. It will automatically shut down when the provisioning process is complete.
<div align="center"> <img src="img/local_prepare_screen_unattend.png" alt="local_prepare_screen_unattend" height="500"/> </div> <div align="center"> <img src="img/local_prepare_screen_setup.png" alt="local_prepare_screen_setup" height="500"/> </div>At the end, you should expect the Docker container named winarena to gracefully terminate as shown from the below logs.
You will find the 30GB WAA golden image in WindowsAgentArena/src/win-arena-container/vm/storage, consisting of the following files:
Additional Notes
- During development, if you want to include any changes made in the
src/win-arena-containerdirectory in the WAA golden image, please ensure to specify the flag--skip-build falseto therun-local.shscript (default to true). This will ensure that a new container image is built instead than using the prebuiltwindowsarena/winarena:latestimage. - If you have previously run an installation process and want to do it again from scratch, make sure to delete the content of
storage. - We recommend copying this
storagefolder to a safe location outside of the repository in case you or the agent accidentally corrupt the VM at some point and you want to avoid a fresh setup. - Depending on your docker settings, you might have to run the above command with
sudo. - Running on WSL2? If you encounter the error
/bin/bash: bad interpreter: No such file or directory, we recommend converting the bash scripts from DOS/Windows format to Unix format:
cd ./scripts
find . -maxdepth 1 -type f -exec dos2unix {} +
4. Deploying the agent in the arena
4.1 Running the base benchmark
You're now ready to launch the evaluation. To run the baseline agent on all benchmark tasks, do:
cd scripts
./run-local.sh
# For client/agent options:
# ./run-local.sh --help
Open http://localhost:8006 to see the Windows VM with the agent running. If you have a beefy PC, you can instead run the strongest agent configuration in our paper by doing:
./run-local.sh --gpu-enabled true --som-origin mixed-omni --a11y-backend uia
At the end of the run you can display the results using the command:
cd src/win-arena-container/client
python show_results.py
