AgentLab
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
Install / Use
/learn @ServiceNow/AgentLabREADME
🛠️ Setup | 🤖 Assistant | 🚀 Launch Experiments | 🔍 Analyse Results | <br> 🏆 Leaderboard | 🤖 Build Your Agent | ↻ Reproducibility | 💪 BrowserGym
<img src="https://github.com/user-attachments/assets/47a7c425-9763-46e5-be54-adac363be850" alt="agentlab-diagram" width="700"/> </div>[!WARNING] AgentLab is meant to provide an open, easy-to-use and extensible framework to accelerate the field of web agent research. It is not meant to be a consumer product. Use with caution!
AgentLab is a framework for developing and evaluating agents on a variety of benchmarks supported by BrowserGym. It is presented in more details in our BrowserGym ecosystem paper
AgentLab Features:
- Easy large scale parallel agent experiments using ray
- Building blocks for making agents over BrowserGym
- Unified LLM API for OpenRouter, OpenAI, Azure, or self-hosted using TGI.
- Preferred way for running benchmarks like WebArena
- Various reproducibility features
- Unified LeaderBoard
🎯 Supported Benchmarks
| Benchmark | Setup <br> Link | # Task <br> Template| Seed <br> Diversity | Max <br> Step | Multi-tab | Hosted Method | BrowserGym <br> Leaderboard | |-----------|------------|---------|----------------|-----------|-----------|---------------|----------------------| | WebArena | setup | 812 | None | 30 | yes | self hosted (docker) | soon | | WebArena-Verified | setup | 812 | None | 30 | yes | self hosted | soon | | WorkArena L1 | setup | 33 | High | 30 | no | demo instance | soon | | WorkArena L2 | setup | 341 | High | 50 | no | demo instance | soon | | WorkArena L3 | setup | 341 | High | 50 | no | demo instance | soon | | WebLinx | - | 31586 | None | 1 | no | self hosted (dataset) | soon | | VisualWebArena | setup | 910 | None | 30 | yes | self hosted (docker) | soon | | AssistantBench | setup | 214 | None | 30 | yes | live web | soon | | GAIA (soon) | - | - | None | - | - | live web | soon | | Mind2Web-live (soon) | - | - | None | - | - | live web | soon | | MiniWoB | setup | 125 | Medium | 10 | no | self hosted (static files) | soon | | OSWorld | setup | 369 | None | - | - | self hosted | soon | | TimeWarp | setup | 1386 | None | 30 | yes | self hosted | soon |
🛠️ Setup AgentLab
AgentLab requires python 3.11 or 3.12.
pip install agentlab
If not done already, install Playwright:
playwright install
Make sure to prepare the required benchmark according to the instructions provided in the setup column.
export AGENTLAB_EXP_ROOT=<root directory of experiment results> # defaults to $HOME/agentlab_results
export OPENAI_API_KEY=<your openai api key> # if openai models are used
<details>
<summary>Setup OpenRouter API</summary>
export OPENROUTER_API_KEY=<your openrouter api key> # if openrouter models are used
</details>
<details>
<summary>Setup Azure API</summary>
export AZURE_OPENAI_API_KEY=<your azure api key> # if using azure models
export AZURE_OPENAI_ENDPOINT=<your endpoint> # if using azure models
</details>
🤖 UI-Assistant
Use an assistant to work for you (at your own cost and risk).
agentlab-assistant --start_url https://www.google.com
Try your own agent:
agentlab-assistant --agent_config="module.path.to.your.AgentArgs"
🚀 Launch experiments
# Import your agent configuration extending bgym.AgentArgs class
# Make sure this object is imported from a module accessible in PYTHONPATH to properly unpickle
from agentlab.agents.generic_agent import AGENT_4o_MINI
from agentlab.experiments.study import make_study
study = make_study(
benchmark="miniwob", # or "webarena", "workarena_l1" ...
agent_args=[AGENT_4o_MINI],
comment="My first study",
)
study.run(n_jobs=5)
Relaunching incomplete or errored tasks
from agentlab.experiments.study import Study
study = Study.load("/path/to/your/study/dir")
study.find_incomplete(include_errors=True)
study.run()
See main.py to launch experiments with a variety of options. This is like a lazy CLI that is actually more convenient. Just comment and uncomment the lines you need or modify at will (but don't push to the repo).
Job Timeouts
The complexity of the wild web, Playwright, and asyncio can sometimes cause jobs to hang. This disables workers until the study is terminated and relaunched. If you are running jobs sequentially or with a small number of workers, this could halt your entire study until you manually kill and relaunch it. In the Ray parallel backend, we've implemented a system to automatically terminate jobs exceeding a specified timeout. This feature is particularly useful when task hanging limits your experiments.
Debugging
For debugging, run experiments with n_jobs=1 and use VSCode's debug mode. This allows you to pause
execution at breakpoints.
About Parallel Jobs
Running one agent on one task corresponds to a single job. Conducting ablation studies or random searches across hundreds of tasks with multiple seeds can generate more than 10,000 jobs. Efficient parallel execution is therefore critical. Agents typically wait for responses from the LLM server or updates from the web server. As a result, you can run 10–50 jobs in parallel on a single computer, depending on available RAM.
⚠️ Note for (Visual)WebArena: These benchmarks have task dependencies designed to minimize "corrupting" the instance between tasks. For example, an agent on task 323 could alter the instance state, making task 201 impossible. To address this, the Ray backend accounts for task dependencies, enabling some degree of parallelism. On WebArena, you can disable dependencies to increase parallelism, but this might reduce performance by 1–2%.
⚠️ Instance Reset for (Visual)WebArena: Before evaluating an agent, the instance is
automatically reset, a process that takes about 5 minutes. When evaluating multiple agents, the
make_study function returns a SequentialStudies object to ensure proper sequential evaluation of
each agent. AgentLab currently does not support evaluations across multiple instances, but you could
either create a quick script to handle this or submit a PR to AgentLab. For a smoother parallel
experience, consider using benchmarks like WorkArena instead.
🔍 Analyse Results
Loading Results
The class ExpResult provides a lazy loader for all the information of a specific experiment. You can use yield_all_exp_results to recursively find all results in a directory. Finally load_result_df gathers all the summary information in a single dataframe. See [`inspect_resu
Related Skills
node-connect
349.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.7kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
