AgentBoard
An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]
Install / Use
/learn @hkust-nlp/AgentBoardREADME
What's New
- [2024.01.15] 📣 AgentBoard is released.
- [2024.03.11] 🥳 AgentBoard is accepted by LLMAgents @ ICLR 2024
Introduction
AgentBoard emphasizes analytical evaluation for Large Language Models (LLMs) as generalist agents to perceive and act within various environments. It outlines four principles for constructing a benchmark to evaluate LLMs as generalist agents:
- Task Diversity: AgentBoard incorporates 9 distinct tasks to comprehensively understand the generalist ability of LLM agents, which is built upon LLM's extensive knowledge base and exceptional scenario comprehension.
- Multi-round Intercation: AgentBoard provides multi-round interaction between agents and environment, which is necessary to reflect the evolutionary nature of human intelligence, which continuously receives information and adapts towards the environment.
- Partially-Observable Environments: In AgentBoard, the complete state of the environment is not available to the agent, which assesses agent world modeling ability as additional knowledge needs to be acquired through online exploration.
- Analytical Evaluation: AgentBoard is a systematic evaluation platform: it includes a user-friendly script to construct goal-oriented reflex agents for a range of models, and features a panel for visualizing and interpreting results across multiple dimensions of agent proficiency, including fine-grained progress rates, grounding accuracy, performance breakdown for hard and easy examples, long-range in- teractions, detailed performance across various sub-skills, and trajectory with friendly visualization
Table of Contents
<details> <summary> Click to expand the table of contents </summary> </details>🚀 Quick Start
Here we provide a quick start guide to evaluate LLM agents on AgentBoard within 30 minutes.
Setup Environment
We provide both local setup (recommended) and docker as follows:
<details> <summary> Click to expand local setup procedures (~ 15 minutes). </summary>Setup with a setup.sh:
Step 1. Create a conda environment
conda create -n ${YOUR_ENV_NAME} python=3.8.13 # python version should be 3.8.13
conda activate ${YOUR_ENV_NAME}
Step 2. Git clone this repo
git clone https://github.com/hkust-nlp/AgentBoard.git
Step 3. Download the data from huggingface
# Download the data and move it to the project root dir
cd AgentBoard
mkdir data
wget https://huggingface.co/datasets/hkust-nlp/agentboard/resolve/main/data.tar.gz
tar -zxvf data.tar.gz
Step 4. Set up the environment for tasks except WebArena
INSTALL_WEBARENA=false bash ./setup.sh
# After running the above command, the env will support other tasks than WebArena
Step 5. Set up the environment for WebArena
# Please check whether the dubs and Xvfb are installed before building it
# For Ubuntu or Debian
dpkg -l | grep dbus # will return the info
systemctl status dbus # will return the status(active (running))
dpkg -l | grep xvfb # will return the info
#-----------------------------------------------------------------------#
# For CentOS
yum list installed | grep Xvfb # will return the Xvfb info
systemctl status dbus # will return the status(active (running))
dnf list installed | grep dbus # will return the dbus info
If so, you may install the webarena environment directly.
INSTALL_WEBARENA=true bash ./setup.sh
If not, please jump to Step 6 or Installation by Docker
(Additional) Step 6. Install the dubs and Xvfb
# You must use the sudo permission to do the following:
# For Ubuntu or Debian
# Install and start the dbus service
apt-get install dbus
/etc/init.d/dbus start
# Install ans start the Xvfb
sudo apt-get update
sudo apt-get install xvfb
INSTALL_WEBARENA=true bash ./setup.sh
#--------------------------------------------------------#
# For Centos
# Install and start the dbus service
yum install -y dbus-x11
/etc/init.d/dbus start
# Install ans start the Xvfb
yum update
yum install -y Xvfb
INSTALL_WEBARENA=true bash ./setup.sh
</details>
<details>
<summary>
Click to expand docker setup procedures. (~12G, 5 minutes)
</summary>
Docker info: CentOS
Step 1. Pull the docker image and run docker locally
docker pull zzh202121/agentboard:0117
docker run -itd \
--gpus all \
--network host \
--name agent_space \
--shm-size 64gb \
-v /MODEL_PATH:/model_download \
-v /DATA_PATH:/data \
zzh202121/agentboard:0117 \
/bin/bash
docker attach agent_space # YOUR_CONTAINER_NAME
Step 2. activate the env
conda activate agentboard
Step 3. Download the code and data
git clone https://github.com/hkust-nlp/AgentBoard.git # clone repo
# Download the data and move it to the project root dir
cd AgentBoard
mkdir data
wget https://huggingface.co/datasets/hkust-nlp/agentboard/resolve/main/data.tar.gz
tar -zxvf data.tar.gz
Step 3. Build search engine index(For WebShop)
cd ./agentboard/environment/WebShop/search_engine
mkdir -p resources resources_100 resources_1k resources_100k
python convert_product_file_format.py # convert items.json => required doc format
mkdir -p indexes
./run_indexing.sh
cd ../../../
Step 4. Start web service(For Webarena)
/etc/init.d/dbus start # start dbus
Xvfb :99 -screen 0 1280x720x24 & # start xvfb display
export DISPLAY=:99
python -m playwright install
</details>
Setup Environment Variables in AgentBoard/.env
Environment Variables needed for AgentBoard include:
PROJECT_PATH = {path to project}/AgentBoard
ANTHROPIC_API_KEY=...
OPENAI_API_KEY=...
TODO_KEY=...
MOVIE_KEY=...
SHEET_EMAIL=...
WANDB_API_KEY=...
<details>
<summary>
Click to expand API key setup procedures.
</summary>
Variables 1: API keys for Tool tasks
Since API keys for Tool tasks are private, we do not provide them in this repo.
Please follow this detailed guide to get API keys for Tool tasks.
Variables 2: Weights&Bias key for AgentBoard Online Visualization
Please paste WANDB_API_KEY from here guide in .env file to login Weights&Bias for AgentBoard Visulization.
Variables 3: API keys for Proprietary models
⚠️ You don't need to setup API keys for models you don't want to use.
If you use OpenAI models, please put your API keys in .env file.
OPENAI_API_TYPE="open_ai"
OPENAI_API_KEY=${YOUR_OPENAI_API_KEY}
If you use Anthropic models, please put your API keys in .env file.
ANTHROPIC_API_KEY=${YOUR_ANTHROPIC_API_KEY}
</details>
Evaluate Models
Example script for GPT-3.5-Turbo:
python agentboard/eval_main.py \
--cfg-path eval_configs/main_results_all_tasks.yaml \
--tasks alfworld \
--model gpt-3.5-turbo-0613 \
--wandb \
--log_path ./results/gpt-3.5-turbo-0613 \
--project_name evaluate-gpt-35-turbo-0613 \
--baseline_dir ./data/baseline_results
We now offer configuration for 12 SOTA LLM models (gpt-4,gpt-3.5-turbo-0613, text-davinci-003,claude2,deepseek-67b,lemur-70b, mistral-7b,codellama-13b(34b),llama2-13b(70b),vicuna-13b-16k) and a simple reflex agent based on act-only prompting. You could also customize your own agents and LLMs. Models supported by vLLM should be generally supported in AgentBoard, while different models may require specific prompt templates.
Launch AgentBoard Analytical Evaluation Pan
Related Skills
node-connect
325.6kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
80.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
325.6kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
80.2kCommit, push, and open a PR
