AgentLab

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

Generate Convert Improve

Install / Use

/learn @ServiceNow/AgentLab

About this skill

Quality Score

0/100

README

Demo solving tasks:

</div>

[!WARNING] AgentLab is meant to provide an open, easy-to-use and extensible framework to accelerate the field of web agent research. It is not meant to be a consumer product. Use with caution!

AgentLab is a framework for developing and evaluating agents on a variety of benchmarks supported by BrowserGym. It is presented in more details in our BrowserGym ecosystem paper

AgentLab Features:

Easy large scale parallel agent experiments using ray
Building blocks for making agents over BrowserGym
Unified LLM API for OpenRouter, OpenAI, Azure, or self-hosted using TGI.
Preferred way for running benchmarks like WebArena
Various reproducibility features
Unified LeaderBoard

🎯 Supported Benchmarks

| Benchmark | Setup <br> Link | # Task <br> Template| Seed <br> Diversity | Max <br> Step | Multi-tab | Hosted Method | BrowserGym <br> Leaderboard | |-----------|------------|---------|----------------|-----------|-----------|---------------|----------------------| | WebArena | setup | 812 | None | 30 | yes | self hosted (docker) | soon | | WebArena-Verified | setup | 812 | None | 30 | yes | self hosted | soon | | WorkArena L1 | setup | 33 | High | 30 | no | demo instance | soon | | WorkArena L2 | setup | 341 | High | 50 | no | demo instance | soon | | WorkArena L3 | setup | 341 | High | 50 | no | demo instance | soon | | WebLinx | - | 31586 | None | 1 | no | self hosted (dataset) | soon | | VisualWebArena | setup | 910 | None | 30 | yes | self hosted (docker) | soon | | AssistantBench | setup | 214 | None | 30 | yes | live web | soon | | GAIA (soon) | - | - | None | - | - | live web | soon | | Mind2Web-live (soon) | - | - | None | - | - | live web | soon | | MiniWoB | setup | 125 | Medium | 10 | no | self hosted (static files) | soon | | OSWorld | setup | 369 | None | - | - | self hosted | soon | | TimeWarp | setup | 1386 | None | 30 | yes | self hosted | soon |

🛠️ Setup AgentLab

AgentLab requires python 3.11 or 3.12.

pip install agentlab

If not done already, install Playwright:

playwright install

Make sure to prepare the required benchmark according to the instructions provided in the setup column.

export AGENTLAB_EXP_ROOT=<root directory of experiment results>  # defaults to $HOME/agentlab_results
export OPENAI_API_KEY=<your openai api key> # if openai models are used

<details> <summary>Setup OpenRouter API</summary>

export OPENROUTER_API_KEY=<your openrouter api key> # if openrouter models are used

</details> <details> <summary>Setup Azure API</summary>

export AZURE_OPENAI_API_KEY=<your azure api key> # if using azure models
export AZURE_OPENAI_ENDPOINT=<your endpoint> # if using azure models

</details>

🤖 UI-Assistant

Use an assistant to work for you (at your own cost and risk).

agentlab-assistant --start_url https://www.google.com

Try your own agent:

agentlab-assistant --agent_config="module.path.to.your.AgentArgs"

🚀 Launch experiments

# Import your agent configuration extending bgym.AgentArgs class
# Make sure this object is imported from a module accessible in PYTHONPATH to properly unpickle
from agentlab.agents.generic_agent import AGENT_4o_MINI 

from agentlab.experiments.study import make_study

study = make_study(
    benchmark="miniwob",  # or "webarena", "workarena_l1" ...
    agent_args=[AGENT_4o_MINI],
    comment="My first study",
)

study.run(n_jobs=5)

Relaunching incomplete or errored tasks

from agentlab.experiments.study import Study
study = Study.load("/path/to/your/study/dir")
study.find_incomplete(include_errors=True)
study.run()

See main.py to launch experiments with a variety of options. This is like a lazy CLI that is actually more convenient. Just comment and uncomment the lines you need or modify at will (but don't push to the repo).

Job Timeouts

The complexity of the wild web, Playwright, and asyncio can sometimes cause jobs to hang. This disables workers until the study is terminated and relaunched. If you are running jobs sequentially or with a small number of workers, this could halt your entire study until you manually kill and relaunch it. In the Ray parallel backend, we've implemented a system to automatically terminate jobs exceeding a specified timeout. This feature is particularly useful when task hanging limits your experiments.

Debugging

For debugging, run experiments with n_jobs=1 and use VSCode's debug mode. This allows you to pause execution at breakpoints.

About Parallel Jobs

Running one agent on one task corresponds to a single job. Conducting ablation studies or random searches across hundreds of tasks with multiple seeds can generate more than 10,000 jobs. Efficient parallel execution is therefore critical. Agents typically wait for responses from the LLM server or updates from the web server. As a result, you can run 10–50 jobs in parallel on a single computer, depending on available RAM.

⚠️ Note for (Visual)WebArena: These benchmarks have task dependencies designed to minimize "corrupting" the instance between tasks. For example, an agent on task 323 could alter the instance state, making task 201 impossible. To address this, the Ray backend accounts for task dependencies, enabling some degree of parallelism. On WebArena, you can disable dependencies to increase parallelism, but this might reduce performance by 1–2%.

⚠️ Instance Reset for (Visual)WebArena: Before evaluating an agent, the instance is automatically reset, a process that takes about 5 minutes. When evaluating multiple agents, the make_study function returns a SequentialStudies object to ensure proper sequential evaluation of each agent. AgentLab currently does not support evaluations across multiple instances, but you could either create a quick script to handle this or submit a PR to AgentLab. For a smoother parallel experience, consider using benchmarks like WorkArena instead.

🔍 Analyse Results

Loading Results

The class ExpResult provides a lazy loader for all the information of a specific experiment. You can use yield_all_exp_results to recursively find all results in a directory. Finally load_result_df gathers all the summary information in a single dataframe. See [`inspect_resu

Related Skills

node-connect

349.7k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.7k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.7k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。