<img src="asset/project_logo.png" width="50%" /> <h1 align="center">Build Arena</h1> The First Physics-Aligned Interactive Benchmark for Language-Driven Engineering Construction <a href="https://arxiv.org/abs/2510.16559"><img src="https://img.shields.io/badge/📄-Paper-blue?style=for-the-badge" alt="Paper"></a> <a href="https://github.com/AI4Science-WestlakeU/BuildArena"><img src="https://img.shields.io/badge/💻-Code-green?style=for-the-badge" alt="Code"></a> <a href="https://build-arena.github.io/"><img src="https://img.shields.io/badge/🌐-Project%20Page-orange?style=for-the-badge" alt="Project Page"></a> <a href="#-installation"><img src="https://img.shields.io/badge/🚀-Quick%20Start-red?style=for-the-badge" alt="Quick Start"></a> <a href="https://store.steampowered.com/app/346010/"><img src="https://img.shields.io/badge/🎮-Besiege-purple?style=for-the-badge" alt="Besiege"></a> <a href="LICENSE"><img src="https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey?style=for-the-badge" alt="License: CC BY-NC 4.0"></a> <a href="https://ai4s.lab.westlake.edu.cn/"> <img src="asset/lab_logo.png" width="40%" alt="AI for Scientific Simulation and Discovery Lab" /> </a> <img src="asset/car1.gif" width="30%" /> <img src="asset/brg1.gif" width="30%" /> <img src="asset/rkt1.gif" width="30%" />

🙏 Special Thanks

We are grateful to Spiderling Studios for creating Besiege, the inspiring physics sandbox that underpins our work. We also thank the developers of the open-source projects Lua Scripting Mod and Besiege Creation Import Addon for Blender for their valuable contributions to the community.

We also gratefully acknowledge the support of Westlake University Research Center for Industries of the Future.

📅 Timeline

2025-10-17 🚀 Repository launched with baseline implementation
2025-10-20 📄 Preprint paper released on arXiv: 2510.16559
Ongoing 🔧 Active development and code updates

Status: We are actively developing and improving the codebase. Stay tuned for the continuous updates!

🏆 Performance Leaderboard

We evaluate eight frontier large language models on Build Arena across three task categories (Transport, Support, Lift) and three difficulty levels (Lv.1 Easy, Lv.2 Medium, Lv.3 Hard) under our baseline agentic workflow. Performance is measured by success rate, with 64 samples per task-model pair to ensure statistical reliability.

| Rank | Model | Full Model Name | Transport Avg Success Rate | Support Avg Success Rate | Lift Avg Success Rate | Overall Performance | |:----:|-------|----------------|:-----------------------------:|:---------------------------:|:------------------------:|:-------------------:| | 🥇 | Grok-4 | grok4-0709 | 11.5% | 20.8% | 21.9% | Excellent | | 🥈 | Claude-4 | claude-sonnet-4-20250514 | 12.5% | 3.1% | 4.2% | Good | | 🥉 | Seed-1.6 | doubao-seed-1-6-250615 | 6.2% | 19.3% | 2.1% | Good | | 4 | GPT-4o | gpt-4o | 6.2% | 13.5% | 3.6% | Moderate | | 5 | Kimi-K2 | kimi-k2-turbo-preview | 4.7% | 11.5% | 5.2% | Moderate | | 6 | Qwen-3 | qwen-plus (Qwen3 series) | 5.7% | 5.7% | 1.0% | Moderate | | 7 | DeepSeek-3.1 | deepseek-chat (DeepSeek-V3.1) | 2.6% | 8.3% | 3.6% | Moderate | | 8 | Gemini-2.0 | gemini-2.0-flash | 1.6% | 7.8% | 0.0% | Moderate |

Success rates are averaged across all three difficulty levels (Lv.1, Lv.2, Lv.3) for each task category under our baseline agentic workflow. Full model snapshots and detailed experimental setup are available in the paper appendix.

Multi-Dimensional Performance Analysis

<img src="asset/radar_plot.jpg" width="70%" alt="Radar Plot: Performance of different LLMs against six dimensions of task difficulty" /> Figure: Performance of different LLMs against six dimensions of task difficulty: Quantification (Q), Robustness (R), Magnitude (M), Compositionality (C), Precision (P), Ambiguity (A).

📦 Installation

Step 1: Install uv Package Manager

Install uv following the official guidance.

Step 2: Synchronize Virtual Environment

uv sync

Step 3: Configure API Keys and Paths

Create a config.py file in the project root directory with the following content:

💡 Tip: For the UI position coordinates below, you can keep the default values for now. Later, when you need to run simulations, we provide a convenient find_coords tool to help you calibrate these positions for your specific screen setup (see Step 6 in Simulation Process).

# Path of the directory where all the machines will be saved as BSG files
# SavedMachines of the Besiege game (you can find it in the Steam) is recommended
SavedMachines = "/path/to/Besiege/Contents/SavedMachines"

# API keys for the LLMs
# Leave an API_KEY as it is if it's not provided by you
API_KEY_OAI = "<Your OpenAI API key>"
API_KEY_DS = "<Your DeepSeek API key>"
API_KEY_ANT = "<Your Anthropic API key>"
API_KEY_ARC = "<Your Arc API key>"
API_KEY_XAI = "<Your XAI API key>"
API_KEY_MS = "<Your Moonshot API key>"
API_KEY_ALI = "<Your Aliyun API key>"
API_KEY_GOOGLE = "<Your Google API key>"

# Automation clicking fractional position: (x: horizontal from left 0 to right 1, y: vertical: from top 0 to bottom 1)
# You can keep these default values and calibrate them later using the find_coords tool
# POS_OPEN_FOLDER: Open the folder to load the machine, a button on the left part of the top column
POS_OPEN_FOLDER = (0.202, 0.035)
# POS_ENTER_NAME: The machine name entering frame
POS_ENTER_NAME = (0.476, 0.215)
# POS_OPEN_MACHINE: Open the machine button, on the right side of the machine name input box
POS_OPEN_MACHINE = (0.638, 0.209)
# POS_SET_GROUND: Set the ground button, on the middle of the top column
POS_SET_GROUND = (0.403, 0.0185)
# POS_LOG_WINDOW: The position of the Lua scripting log window, on the right side of the Lua panel
POS_LOG_WINDOW = (0.185, 0.172)
# POS_EMPTY_SPACE: An arbitrary position with no button or machine to click for resetting the UI
POS_EMPTY_SPACE = (0.034, 0.726)
# POS_START_SIMU: The start button on the upper left corner
POS_START_SIMU = (0.021, 0.016)
# POS_DELETE: The delete button for deleting the entire machine
POS_DELETE = (0.707, 0.038)
# POS_CONFIRM: The yes confirmation button after clicking the delete button
POS_CONFIRM = (0.538, 0.586)

Note: Replace all placeholder values (paths and API keys) with your actual configuration.

🚀 Usage

📚 3D Spatial Computation Library

The intro.ipynb notebook provides a detailed introduction and demonstration of the library's spatial computation functions.

🏗️ Construction Process with Default Tasks

Note: The construction process runs independently without requiring the Besiege game.

1. Task Configuration

Task details for different categories and levels can be found in levels.yaml.

2. Start Construction

Run the run_construction.py script to start the construction process:

uv run -m script.run_construction \
  --model gpt-4o \
  --category transport \
  --level soft \
  --n_sample 64 \
  --n_worker 4

3. Monitor Progress

You can monitor the process status in the task database:

datacache/{category}_{level}_{model}_{timestamp}/task_database.db

4. View Results

The construction result BSG files can be imported into the Besiege game for viewing.

🎮 Simulation Process with Default Tasks

⚠️ Important: This repository DOES NOT contain Besiege (a commercial software), which is required for simulation. The simulation scripts are only tested and verified on Windows.

1. Purchase and Install Besiege

Purchase the game Besiege through the Steam platform.

2. Install Lua Scripting Mod

Subscribe to the Lua Scripting Mod through Steam Besiege Workshop. Steam will automatically download and install the mod. Restart the game if it's already open.

3. Configure SavedMachines Path

Right-click the game in Steam and select "Manage → Browse local files". Locate the SavedMachines folder and add its path to your config.py, so the constructed machines can be accessed in the game.

4. Activate Lua Scripting Mod

Start the game and ensure the Lua Scripting Mod is activated:

5. Prepare Sandbox Environment

Enter the last sandbox on the right. Press Ctrl+L to show the Lua Scripting Mod panel, then move it slightly to ensure it doesn't block the start button:

6. Calibrate UI Positions (Optional)

If needed, update the position constants in config.py. We provide a tool to find fractional coordinates:

uv run -m script.find_coords

Press p to print coordinates
P

BuildArena

Install / Use

README