SkillAgentSearch skills...

AgentBoard

An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]

Install / Use

/learn @hkust-nlp/AgentBoard
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <img src="./assets/agentboard.png" style="width: 20%;height: 10%"> <h1> AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents </h1> </div> <div align="center">

Data License Code License Python 3.8+ slack badge

</div> <div align="center"> <!-- <a href="#model">Model</a> • --> 🌐 <a href="https://hkust-nlp.github.io/agentboard">Website</a> | 🏆 <a href="https://hkust-nlp.github.io/agentboard/static/leaderboard.html">Leaderboard</a> | 📚 <a href="https://huggingface.co/datasets/hkust-nlp/agentboard">Data</a> | 📃 <a href="https://arxiv.org/abs/2401.13178">Paper</a> | 📊 <a href="https://wandb.ai/agentboard/llm-agent-eval-gpt-35-turbo-all/reports/Using-Wandb-to-Launch-AgentBoard--Vmlldzo2MTg1Njc4">Panel</a> </div>

What's New

  • [2024.01.15] 📣 AgentBoard is released.
  • [2024.03.11] 🥳 AgentBoard is accepted by LLMAgents @ ICLR 2024

Introduction

AgentBoard emphasizes analytical evaluation for Large Language Models (LLMs) as generalist agents to perceive and act within various environments. It outlines four principles for constructing a benchmark to evaluate LLMs as generalist agents:

  1. Task Diversity: AgentBoard incorporates 9 distinct tasks to comprehensively understand the generalist ability of LLM agents, which is built upon LLM's extensive knowledge base and exceptional scenario comprehension.
  2. Multi-round Intercation: AgentBoard provides multi-round interaction between agents and environment, which is necessary to reflect the evolutionary nature of human intelligence, which continuously receives information and adapts towards the environment.
  3. Partially-Observable Environments: In AgentBoard, the complete state of the environment is not available to the agent, which assesses agent world modeling ability as additional knowledge needs to be acquired through online exploration.
  4. Analytical Evaluation: AgentBoard is a systematic evaluation platform: it includes a user-friendly script to construct goal-oriented reflex agents for a range of models, and features a panel for visualizing and interpreting results across multiple dimensions of agent proficiency, including fine-grained progress rates, grounding accuracy, performance breakdown for hard and easy examples, long-range in- teractions, detailed performance across various sub-skills, and trajectory with friendly visualization
<div align="center"> <img src="./assets/main_graph.png"> <!-- <h1> A nice pic from our website </h1> --> </div>

Table of Contents

<details> <summary> Click to expand the table of contents </summary> </details>

🚀 Quick Start

Here we provide a quick start guide to evaluate LLM agents on AgentBoard within 30 minutes.

Setup Environment

We provide both local setup (recommended) and docker as follows:

<details> <summary> Click to expand local setup procedures (~ 15 minutes). </summary>

Setup with a setup.sh:

Step 1. Create a conda environment

conda create -n ${YOUR_ENV_NAME} python=3.8.13  # python version should be 3.8.13
conda activate ${YOUR_ENV_NAME}

Step 2. Git clone this repo

git clone https://github.com/hkust-nlp/AgentBoard.git

Step 3. Download the data from huggingface

# Download the data and move it to the project root dir
cd AgentBoard
mkdir data
wget https://huggingface.co/datasets/hkust-nlp/agentboard/resolve/main/data.tar.gz
tar -zxvf data.tar.gz

Step 4. Set up the environment for tasks except WebArena

INSTALL_WEBARENA=false bash ./setup.sh

# After running the above command, the env will support other tasks than WebArena

Step 5. Set up the environment for WebArena

# Please check whether the dubs and Xvfb are installed before building it
# For Ubuntu or Debian
dpkg -l | grep dbus  # will return the info
systemctl status dbus  # will return the status(active (running))
dpkg -l | grep xvfb  # will return the info

#-----------------------------------------------------------------------#

# For CentOS
yum list installed | grep Xvfb  # will return the Xvfb info
systemctl status dbus  # will return the status(active (running))
dnf list installed | grep dbus  # will return the dbus info

If so, you may install the webarena environment directly.

INSTALL_WEBARENA=true bash ./setup.sh

If not, please jump to Step 6 or Installation by Docker

(Additional) Step 6. Install the dubs and Xvfb

# You must use the sudo permission to do the following:

# For Ubuntu or Debian
# Install and start the dbus service
apt-get install dbus
/etc/init.d/dbus start

# Install ans start the Xvfb
sudo apt-get update
sudo apt-get install xvfb

INSTALL_WEBARENA=true bash ./setup.sh
#--------------------------------------------------------#

# For Centos
# Install and start the dbus service
yum install -y dbus-x11
/etc/init.d/dbus start

# Install ans start the Xvfb
yum update
yum install -y Xvfb

INSTALL_WEBARENA=true bash ./setup.sh
</details> <details> <summary> Click to expand docker setup procedures. (~12G, 5 minutes) </summary>

Docker info: CentOS

Step 1. Pull the docker image and run docker locally

docker pull zzh202121/agentboard:0117
docker run -itd \
    --gpus all \
    --network host \
    --name agent_space \
    --shm-size 64gb \
    -v /MODEL_PATH:/model_download \
    -v /DATA_PATH:/data \
    zzh202121/agentboard:0117 \
    /bin/bash
docker attach agent_space # YOUR_CONTAINER_NAME

Step 2. activate the env

conda activate agentboard

Step 3. Download the code and data

git clone https://github.com/hkust-nlp/AgentBoard.git  # clone repo
# Download the data and move it to the project root dir
cd AgentBoard
mkdir data
wget https://huggingface.co/datasets/hkust-nlp/agentboard/resolve/main/data.tar.gz
tar -zxvf data.tar.gz

Step 3. Build search engine index(For WebShop)

cd ./agentboard/environment/WebShop/search_engine
mkdir -p resources resources_100 resources_1k resources_100k
python convert_product_file_format.py # convert items.json => required doc format
mkdir -p indexes
./run_indexing.sh
cd ../../../

Step 4. Start web service(For Webarena)

/etc/init.d/dbus start  # start dbus
Xvfb :99 -screen 0 1280x720x24 &  # start xvfb display
export DISPLAY=:99
python -m playwright install
</details>

Setup Environment Variables in AgentBoard/.env

Environment Variables needed for AgentBoard include:

PROJECT_PATH = {path to project}/AgentBoard

ANTHROPIC_API_KEY=...
OPENAI_API_KEY=...

TODO_KEY=...
MOVIE_KEY=...
SHEET_EMAIL=...

WANDB_API_KEY=...
<details> <summary> Click to expand API key setup procedures. </summary>

Variables 1: API keys for Tool tasks

Since API keys for Tool tasks are private, we do not provide them in this repo.

Please follow this detailed guide to get API keys for Tool tasks.

Variables 2: Weights&Bias key for AgentBoard Online Visualization

Please paste WANDB_API_KEY from here guide in .env file to login Weights&Bias for AgentBoard Visulization.

Variables 3: API keys for Proprietary models

⚠️ You don't need to setup API keys for models you don't want to use.

If you use OpenAI models, please put your API keys in .env file.

OPENAI_API_TYPE="open_ai"
OPENAI_API_KEY=${YOUR_OPENAI_API_KEY}

If you use Anthropic models, please put your API keys in .env file.

ANTHROPIC_API_KEY=${YOUR_ANTHROPIC_API_KEY}
</details>

Evaluate Models

Example script for GPT-3.5-Turbo:

python agentboard/eval_main.py \
    --cfg-path eval_configs/main_results_all_tasks.yaml \
    --tasks alfworld \
    --model gpt-3.5-turbo-0613 \
    --wandb \
    --log_path ./results/gpt-3.5-turbo-0613 \
    --project_name evaluate-gpt-35-turbo-0613 \
    --baseline_dir ./data/baseline_results

We now offer configuration for 12 SOTA LLM models (gpt-4,gpt-3.5-turbo-0613, text-davinci-003,claude2,deepseek-67b,lemur-70b, mistral-7b,codellama-13b(34b),llama2-13b(70b),vicuna-13b-16k) and a simple reflex agent based on act-only prompting. You could also customize your own agents and LLMs. Models supported by vLLM should be generally supported in AgentBoard, while different models may require specific prompt templates.

Launch AgentBoard Analytical Evaluation Pan

Related Skills

View on GitHub
GitHub Stars400
CategoryDevelopment
Updated47m ago
Forks41

Languages

SAS

Security Score

80/100

Audited on Mar 20, 2026

No findings