GAMEBoT
[ACL 2025] GAMEBoT: Transparent Assessment of LLM Reasoning in Games
Install / Use
/learn @Visual-AI/GAMEBoTREADME
Update
[2025/08] Add evaluation results to GPT-5 and Gemini 2.5 Pro. Check the gaming replay!
[2025/05] 🌟 Our work has been accepted to the ACL 2025 Main Conference!
[2025/03] Add evaluation results to new models.
Overview
GAMEBoT is a benchmark designed to evaluate the reasoning capabilities of Large Language Models (LLMs) through direct competition in a suite of diverse games. Going beyond simple win/loss outcomes, GAMEBoT facilitates a more transparent assessment by enabling analysis of the intermediate reasoning steps and strategies employed by LLMs during gameplay.
Advantages of using GAMEBoT include:
- Interpretability: Assessments on final decisions and also the intermediate steps.
- Difficulty: Challenging enough to differentiate between top-performing models.
- Hard to be cheated: Interactive gaming environments alleviate data contamination concerns.
- Stronger Baselines: The prompts presented can serve as valuable CoT baselines for future research.
Key Features
- Focus on Reasoning: Evaluates not just game outcomes but the quality of strategic thinking.
- Transparent Evaluation: Provides game logs and visualizations for detailed analysis.
- Extensible Framework: Easily add new LLM agents.
- Diverse Game Suite: Includes 8 games covering different reasoning aspects (e.g., strategy, logic, spatial awareness).
Latest Evaluations
O3-mini-high dominates GAMEBoT! Our latest evaluations show O3-mini-high outperforming top competitors including DeepSeek-R1 and Claude 3.7 Sonnet.
Connect4 Matches
| Model A | Score | Model B | |---------------------------|--------|---------------------------| | gemini-2.0-flash-thinking | 6 : 4 | gpt-4o-0513 | | gemini-2.0-pro-exp | 6 : 4 | gemini-2.0-flash-thinking | | deepseek-r1 | 7 : 3 | gemini-2.0-pro-exp | | deepseek-r1 | 8 : 8 | o1-preview | | o3-mini-high | 10 : 6 | deepseek-r1 | | o3-mini-high | 8 : 8 | claude-3.7-sonnet | | o3-mini-high | 12 : 4 | gpt-4.5 |
Checkers Matches
| Model A | Score | Model B | |--------------------|-------|---------------------------| | gemini-2.0-pro-exp | 5 : 4 | gemini-2.0-flash-thinking | | deepseek-r1 | 9 : 1 | gemini-2.0-pro-exp | | o3-mini-high | 9 : 0 | deepseek-r1 | | o3-mini-high | 9 : 0 | claude-3.7-sonnet |
Evaluated Models
Models Evaluated in the Paper
| Provider | Model Name | API/Identifier |
| :--------- | :------------------------------- | :--------------------------------- |
| OpenAI | GPT-4o | gpt-4o-2024-05-13 |
| OpenAI | GPT-4o mini | gpt-4o-mini-2024-07-18 |
| OpenAI | GPT-4 Turbo | gpt-4-1106 |
| Google | Gemini 1.5 Pro | gemini-1.5-pro-preview-0514 |
| Google | Gemini 1.5 Flash | gemini-1.5-flash-preview-0514 |
| Google | Gemini 1.0 Pro | gemini-1.0-pro-002 |
| Anthropic | Claude 3 Haiku | claude-3-haiku@20240307 |
| Anthropic | Claude 3 Sonnet | claude-3-sonnet@20240229 |
| Anthropic | Claude 3.5 Sonnet | claude-3-5-sonnet@20240620 |
| Meta | Llama 3 8B Instruct (via Maas) | meta/LLaMA3-8b-instruct-maas |
| Meta | Llama 3 70B Instruct (via Maas) | meta/LLaMA3-70b-instruct-maas |
| Meta | Llama 3 405B Instruct (via Maas) | meta/LLaMA3-405b-instruct-maas |
| Reka | Reka Flash | reka-flash-20240904 |
| Reka | Reka Core | reka-core-20240415 |
| AI21 Labs | Jamba 1.5 Large | jamba-1.5-large |
| AI21 Labs | Jamba 1.5 Mini | jamba-1.5-mini |
| Mistral AI | Mistral Nemo | mistral-nemo-2407 |
Newly Supported Models
| Provider | Model Name | API/Identifier |
| :--------- | :--------------------------------- | :------------------------------------ |
| Google | Gemini 2.0 Flash Exp | gemini-2.0-flash-exp |
| Google | Gemini 2.0 Flash Thinking Exp | gemini-2.0-flash-thinking-exp-01-21 |
| Google | Gemini 2.0 Pro Exp | gemini-2.0-pro-exp-02-05 |
| Anthropic | Claude 3.5 Sonnet v2 | claude-3-5-sonnet-v2@20241022 |
| Anthropic | Claude 3.5 Haiku | claude-3-5-haiku@20241022 |
| DeepSeek | DeepSeek R1 | deepseek-r1 |
| OpenAI | o1 | o1-2024-12-17 |
| OpenAI | o1 mini | o1-mini-2024-09-12 |
| OpenAI | o3 mini | o3-mini |
Installation
-
Prerequisites:
- Python 3.10
git
-
Clone Repository:
git clone https://github.com/Visual-AI/GAMEBoT.git cd GAMEBoT -
Set up Environment (Recommended):
- Using
venv:python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate` - Using
conda:conda create -n gamebot python=3.10 conda activate gamebot
- Using
-
Install Dependencies:
sh setup_env.sh && pip install -r requirements.txt -
Configure LLM API
- Setup your API keys in
keys.py.
- Setup your API keys in
Adding New LLM Agents
Most of the provided LLMs are from Google Vertex Cloud. You might need to add your own API access or new models:
-
Implement the Agent Interface: Create a new class for your agent within the
agent_list/directory. The class should implement the following method:def get_response_text(self, prompt): # Your code to get the response from the LLM return response_textSee
agent_list/DeepSeek_ByteDanceclass as an example. -
Handle API Keys/Credentials: Ensure your new agent class can access necessary credentials.
-
Register the Agent: Add your new agent class and its identifier string to the
InitAgent()function inagent_list/__init__.py. The code is written to be extensible, and this makes the identifier (e.g.,'my-custom-model') directly usable in the command line forrun_games_and_check/scripts.
Running the Benchmark
- Execution Command:
The general format for running a game between two LLMs is:
python run_games_and_check/<game_script_name>.py <agent1_identifier> <agent2_identifier> [--cycles N] [other_game_specific_args]<game_script_name>.py: The script for the specific game (e.g.,connect4.py).<agent1_identifier>: The identifier for the first player's LLM (e.g.,gpt-4o,gemini-1.5-pro-preview-0514). Use the identifiers listed in the tables above or the ones you add.
