<div align="center"> <img src="coolerbot_nobg.png" width="150" alt="GAMEBoT Logo - LLM Game Benchmark"> <h1>GAMEBoT: Transparent Assessment of LLM Reasoning in Games</h1> <a href="https://arxiv.org/abs/2412.13602">Read the Paper (arXiv)</a> | <a href="https://visual-ai.github.io/gamebot/">Visit the Project Website</a>   </div>

Update

[2025/08] Add evaluation results to GPT-5 and Gemini 2.5 Pro. Check the gaming replay!

[2025/05] 🌟 Our work has been accepted to the ACL 2025 Main Conference!

[2025/03] Add evaluation results to new models.

Overview

GAMEBoT is a benchmark designed to evaluate the reasoning capabilities of Large Language Models (LLMs) through direct competition in a suite of diverse games. Going beyond simple win/loss outcomes, GAMEBoT facilitates a more transparent assessment by enabling analysis of the intermediate reasoning steps and strategies employed by LLMs during gameplay.

Advantages of using GAMEBoT include:

Interpretability: Assessments on final decisions and also the intermediate steps.
Difficulty: Challenging enough to differentiate between top-performing models.
Hard to be cheated: Interactive gaming environments alleviate data contamination concerns.
Stronger Baselines: The prompts presented can serve as valuable CoT baselines for future research.

Key Features

Focus on Reasoning: Evaluates not just game outcomes but the quality of strategic thinking.
Transparent Evaluation: Provides game logs and visualizations for detailed analysis.
Extensible Framework: Easily add new LLM agents.
Diverse Game Suite: Includes 8 games covering different reasoning aspects (e.g., strategy, logic, spatial awareness).

<table> <tr> <td align="center">  <img src="assets/checkers.gif" width="180" alt="Checkers Gameplay GIF"> Checkers </td> <td align="center">  <img src="assets/tictactoe.gif" width="180" alt="TicTacToe Gameplay GIF"> TicTacToe </td> <td align="center">  <img src="assets/connect4.gif" width="180" alt="Connect4 Gameplay GIF"> Connect4 </td> <td align="center">  <img src="assets/othello.gif" width="180" alt="Othello Gameplay GIF"> Othello </td> </tr> <tr> <td align="center">  <img src="assets/pong.gif" width="180" alt="Pong Gameplay GIF"> Pong </td> <td align="center">  <img src="assets/surround.gif" width="180" alt="Surround Gameplay GIF"> Surround </td> <td align="center">  <img src="assets/negotiate.gif" width="220" alt="Negotiate Gameplay GIF"> Negotiate </td> <td align="center">  <img src="assets/poker.gif" width="220" alt="Texas Hold'em Gameplay GIF"> Texas Hold'em </td> </tr> </table>

Latest Evaluations

O3-mini-high dominates GAMEBoT! Our latest evaluations show O3-mini-high outperforming top competitors including DeepSeek-R1 and Claude 3.7 Sonnet.

Connect4 Matches

| Model A | Score | Model B | |---------------------------|--------|---------------------------| | gemini-2.0-flash-thinking | 6 : 4 | gpt-4o-0513 | | gemini-2.0-pro-exp | 6 : 4 | gemini-2.0-flash-thinking | | deepseek-r1 | 7 : 3 | gemini-2.0-pro-exp | | deepseek-r1 | 8 : 8 | o1-preview | | o3-mini-high | 10 : 6 | deepseek-r1 | | o3-mini-high | 8 : 8 | claude-3.7-sonnet | | o3-mini-high | 12 : 4 | gpt-4.5 |

Checkers Matches

| Model A | Score | Model B | |--------------------|-------|---------------------------| | gemini-2.0-pro-exp | 5 : 4 | gemini-2.0-flash-thinking | | deepseek-r1 | 9 : 1 | gemini-2.0-pro-exp | | o3-mini-high | 9 : 0 | deepseek-r1 | | o3-mini-high | 9 : 0 | claude-3.7-sonnet |

Evaluated Models

Models Evaluated in the Paper

Newly Supported Models

Installation

Prerequisites:
- Python 3.10
- git

Clone Repository:

git clone https://github.com/Visual-AI/GAMEBoT.git
cd GAMEBoT

Set up Environment (Recommended):

Using venv:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Using conda:

conda create -n gamebot python=3.10
conda activate gamebot

Install Dependencies:

sh setup_env.sh && pip install -r requirements.txt

Configure LLM API
- Setup your API keys in keys.py.

Adding New LLM Agents

Most of the provided LLMs are from Google Vertex Cloud. You might need to add your own API access or new models:

Implement the Agent Interface: Create a new class for your agent within the agent_list/ directory. The class should implement the following method:
```
    def get_response_text(self, prompt):
          # Your code to get the response from the LLM
          return response_text
```
See agent_list/DeepSeek_ByteDance class as an example.
Handle API Keys/Credentials: Ensure your new agent class can access necessary credentials.
Register the Agent: Add your new agent class and its identifier string to the InitAgent() function in agent_list/__init__.py. The code is written to be extensible, and this makes the identifier (e.g., 'my-custom-model') directly usable in the command line for run_games_and_check/ scripts.

Running the Benchmark

Execution Command: The general format for running a game between two LLMs is:
```
python run_games_and_check/<game_script_name>.py <agent1_identifier> <agent2_identifier> [--cycles N] [other_game_specific_args]
```
- <game_script_name>.py: The script for the specific game (e.g., connect4.py).
- <agent1_identifier>: The identifier for the first player's LLM (e.g., gpt-4o, gemini-1.5-pro-preview-0514). Use the identifiers listed in the tables above or the ones you add.

GAMEBoT

Install / Use

README