<div align="center"> <h1><img height="100px" src="./plots/icon.png" alt="ChessImageBench"><br>ChessImageBench: Chessboard Generation Breaks State-of-the-Art AI Models</h1> <a href="https://www.python.org/"> <img alt="Build" src="https://img.shields.io/badge/Python-3.12-1f425f.svg?color=blue"> </a> <a href="https://opensource.org/licenses/MIT"> <img alt="License: MIT" src="https://img.shields.io/badge/License-MIT-green.svg"> </a> <a href="https://huggingface.co/datasets/JasperDekoninck/ChessImageBench"> <img alt="Datasets" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-ChessImageBench-ffc107?color=ffc107&logoColor=white"> </a> </div>

Chess is often used in AI research to investigate particular behaviors in controlled settings. This repository presents another use case for chess in AI research: as a benchmark for image generation models and vision language models (VLMs). In particular, we require image generators to generate accurate chessboards, label them manually for various types of mistakes, and evaluate VLMs on their ability to recognize these mistakes. The results are very clear: current state-of-the-art image generation models cannot generate accurate chessboards, and current state-of-the-art VLMs do so poorly at recognizing mistakes that even a simple baseline can outperform them.

Why Chessboard Generation Forms a Good Benchmark

There are several key factors that make chessboards very nice for evaluating both image generation models and VLMs:

Present in training data, but not optimized for
Chess is a very popular game, and one can therefore assume that models have seen thousands, if not millions, of chessboards during training. At the same time, "generating chessboards" is not a common or optimized use case, which means that model providers have not specifically tuned for this task.
Local and global consistency is required
Generating an accurate chessboard is not only about producing something visually appealing. It requires both local and global consistency:
- Global consistency: A chessboard must have exactly 8x8 squares, and the depicted position must make sense in a chess game.
- Local consistency: The squares must alternate between two colors, and the pieces themselves must be rendered without distortions.
Counterintuitive for VLMs
VLMs are likely to perform poorly on this task because it requires reasoning about perturbations of a well-known object. Most chessboards seen during training are correct, so the model might simply assume generated boards are also correct, leading to systematic misclassification.
Easy evaluation
The task is simple and requires minimal expertise. Anyone with a basic understanding of chess can evaluate a board for correctness and identify mistakes in roughly 30 seconds per image. Evaluating a single image is straightforward, making the overall benchmarking process efficient and replicable.

📊 Main Results

Experimental Overview

The core setup can be summarized as follows:

We created 100 prompts that each ask for a chessboard in a specific position.
Six state-of-the-art image generation models were used to generate chessboards for these prompts.
The generated chessboards were manually labeled for various types of mistakes. Each mistake type is a binary category (e.g., "is the board 8x8?").
Three state-of-the-art VLMs were evaluated on their ability to detect these mistakes. Their predictions were compared against the manual ground truth.

Image Generation Models

The figure below shows the performance of the different image generation models across mistake types. The results are clear: most generated chessboards were not 8x8 boards, did not show reasonable chess positions, and contained distorted pieces or squares. Additionally, some models also produced 2D boards, though this is not necessarily problematic since the prompts did not specify whether a 2D or 3D board was required.

Notable observations:

Gemini-2.5-Flash-Image generated visually appealing chessboards with lower distortion rates. However, it performed extremely poorly when it came to global consistency, such as ensuring the board was 8x8 or representing a legal chess position.
GPT-Image-1 showed the opposite behavior. It performed better at generating correct 8x8 boards and somewhat reasonable positions, but often introduced distortions in the pieces or squares. Interestingly, it was the only model that managed to generate reasonable positions in a noticeable fraction of images (about 10%).

Model Comparison

Across all 600 generated images, only a single image came close to being a correct chessboard. Even then, it was unclear whether the board was truly 8x8 because the edges were partially cropped.

VLMs

The figure below shows the performance of the different VLMs at recognizing mistakes in the generated images. An image is considered correctly classified if the VLM correctly identifies all mistakes present in that image. Additionally, we included a simple "majority" baseline which always predicts the most common label for each mistake type. The outcome is clear: All VLMs performed worse than the baseline. This strongly indicates that the models fail to reason effectively about subtle structural errors in familiar objects like chessboards. It is worth pointing out that most models do almost perfect on the "2D Chessboard" category, indicating that they do understand some aspects of chessboards.

VLM Comparison

🛠️ Technical Details

We categorize mistakes into several types. Each type is binary (yes/no):

Not 8x8: The chessboard does not have 64 squares arranged in an 8x8 grid.
Unsure 8x8: It is not possible to determine if the board follows the 8x8 pattern (e.g., because part of the board is not visible).
No Alternating Colors: The board fails to alternate square colors properly (e.g., black square shown as white or a square containing multiple colors).
Distorted Pieces: The pieces themselves are distorted (e.g., incorrect shapes, or inconsistent appearances of identical pieces).
Distorted Squares: The squares themselves are not uniform (e.g., rectangles instead of squares, rotated or bent shapes).
Distorted Letters: If rank/file indicators are present, this captures whether they are wrong or nonsensical.
Unreasonable Position: The position shown is illegal or impossible in chess (e.g., too many pieces, multiple kings, pawns on first/last ranks, or pieces floating between squares). Importantly, for partially visible boards, it is considered that no pieces are outside the visible area. Thus, if only a single king is visible, it is considered an unreasonable position.
2D Chessboard: Whether the board is represented in a flat, 2D style. This is not an error in itself, but is tracked.
Instructions not Followed: The prompt instructions were not followed (e.g., the requested position is not shown). If the position is unreasonable, this binary is automatically set to "no" except for some extreme edge cases.

For evaluation and plotting, mistake types were grouped as follows:

Distortions = Distorted Pieces + Distorted Squares + Distorted Letters + No Alternating Colors
8x8 = Opposite of (Not 8x8 + Unsure 8x8)
Unreasonable Position = Unreasonable Position + Instructions not Followed
2D Chessboard = 2D Chessboard

For VLM evaluation, the Instructions not Followed category was discarded.

Models

We used several models for both image generation and VLM evaluation. All configurations can be found in the configs/models folder. Notably, we used the preview version of Gemini-2.5-Flash-Image for the image generation tasks. The prompt used for the VLMs is given in the configs/instructions.txt file.

🖥️ View Results

You can explore the benchmark results in different ways. First, install dependencies:

pip install -e .

You can also use uv for a more reproducible environment. If you do, prepend uv run to all commands in this README.

1. View Via Notebook

The notebook notebooks/plots.ipynb provides all plots and reproduction code.

2. View Via Web Server

A simple web server allows you to browse all images and their manual categorizations:

python app.py

Open your browser at: http://127.0.0.1:5001/. You should see a simple interface that allows you to view all images and their categorizations.

To view categorizations made by VLMs instead of the ground truth:

python app.py --output-dir data/models/gpt-5

3. View on HuggingFace

The benchmark is also hosted on HuggingFace: https://huggingface.co/datasets/JasperDekoninck/ChessImageBench.

🔁 Reproduce Results

To fully reproduce the results:

Delete old data

rm -rf data/images data/models data/output

Generate images for each model. You need to set the appropriate API keys yourself as environment variables.

python scripts/run.py --config flux-max

Label images:

python app.py

Then open http://127.0.0.1:5001/human_judgment and label the images using the interface. 4. Run each VLM on the generated images. You need to set the appropriate API keys yourself as environment variables.

python scripts/run_vlm.py gpt-5

📝 Citation

If you use this benchmark in your research, please cite:

@misc{chessimagebench,
    title={ChessImageBench: AI Models Fail to Generate Accurate Chessboards and Recognize Mistakes in Them},
    author={Jasper Dekoninck},
    year={2025},
    url={https://github.com/JasperDekoninck/ChessImageBench}
}

ChessImageBench

Install / Use

README