GAGE
General AI evaluation and Gauge Engine. A unified evaluation engine for LLMs, MLLMs, audio, and diffusion models.
Install / Use
/learn @HiThink-Research/GAGEREADME
📐 GAGE: General AI evaluation and Gauge Engine
English · 中文
<div align="center">📧 Contact: zhangrongjunchen@myhexin.com
</div>Overview · Sample Schema · Game Arena · Agent Eval · Benchmark · Contributing · Standards
</div>GAGE is a unified, extensible evaluation framework designed for large language models, multimodal (omni, robot) models, audio models and diffusion models. It is a high-performance evaluation engine built for ultra-fast execution, scalability, and flexibility, providing a unified framework for AI model evaluation, agent-based benchmarking, and game arena evaluation.
🎮 Game Arena Showcase
<p align="center"><img src="docs/assets/F448C1D6-7E55-4A40-8A6B-169C421AEC15.gif" width="37.8571%" alt="Game Arena demo 1"><!-- --><img src="docs/assets/7CF87CFF-5C51-4209-8936-E406A5657381.gif" width="28.6905%" alt="Game Arena demo 2"><!-- --><img src="docs/assets/mahjong.gif" width="33.4524%" alt="Mahjong demo"></p> <p align="center"> <img src="docs/assets/space-invaders-game.gif" width="33.3333%" alt="Space Invaders demo"> <img src="docs/assets/mario-game.gif" width="33.3333%" alt="Mario demo"> <img src="docs/assets/vizdoom-game.gif" width="32%" alt="VizDoom demo"> </p>✨ Why GAGE?
-
🚀 Fastest Evaluation Engine: Built for speed. GAGE fully utilizes GPU and CPU resources to run evaluations as fast as possible, scaling smoothly from single-machine testing to million-sample, multi-cluster runs.
-
🔗 All-in-one Evaluation Interface: Evaluate any dataset × any model with minimal glue code. GAGE provides a unified abstraction over datasets, models, metrics, and runtimes, allowing new benchmarks or model backends to be onboarded in minutes.
-
🔌 Extensible (Game & Agent) Sandbox: Natively supports game-based evaluation, agent environments, GUI interaction sandboxes, and tool-augmented tasks. All environments run under the same evaluation engine, making it easy to benchmark LLMs, multimodal models, and agents in a unified way.
-
🧩 Inheritance-Driven Extensibility: Easily extend existing benchmarks by inheriting and overriding only what you need. Add new datasets, metrics, or evaluation logic without touching the core framework or rewriting boilerplate code.
-
📡 Enterprise Observability: More than logs. GAGE provides real-time metrics and visibility into each evaluation stage, making it easy to monitor runs and quickly identify performance bottlenecks or failures.
🧭 Design Overview
Core Design Philosophy: Everything is a Step, Everything is configurable.
Architecture Design

Orchestration Design

GameArena Design

🚀 Quick Start
1. Installation
# If you're in a mono-repo root, run: cd gage-eval-main
# Python 3.10+ recommended
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
2. Run Demo
# Run Echo demo (No GPU required, uses Dummy Backend)
python run.py \
--config config/run_configs/demo_echo_run_1.yaml \
--output-dir runs \
--run-id demo_echo
3. View Reports
Default output structure:
runs/<run_id>/
events.jsonl # Detailed event logs
samples.jsonl # Samples with inputs and outputs
summary.json # Final score summary
📖 Advanced Configurations
| Scenario | Config Example | Description |
| :--- | :--- | :--- |
| Game Arena | config/custom/doudizhu/doudizhu_human_vs_llm.yaml | Doudizhu Human vs LLM match |
| Agent Evaluation | config/custom/appworld/appworld_official_jsonl.yaml | Use Appworld Sandbox |
| Code Gen | config/custom/swebench_pro/swebench_pro_smoke_agent.yaml | SWE-bench (Requires Docker, experimental) |
| Text | config/custom/aime24/aime2024_chat.yaml | Related: AIME 2024, AIME 2025, GPQA, Math500 |
| Multimodal | config/custom/mathvista/chat.yaml | Related: MME, HLE, MathVista |
| LLM Judge | config/custom/examples/single_task_local_judge_qwen.yaml | Use local LLM for grading |
🗺️ Roadmap
- 🤖 Agent Evaluation: Add native agent benchmarking support with tool-use traces, trajectory scoring, and safety checks.
- 🎮 GameArena Expansion: Grow the game catalog and add richer rulesets, schedulers, and evaluation metrics.
- 🛠️ Gage-Client: A dedicated client tool focused on streamlined configuration management, failure diagnostics, and benchmark onboarding.
- 🌐 Distributed Inference: Introduce
RoleType Controllerarchitecture to support multi-node task sharding and load balancing for massive runs. - 🚀 Benchmark Expansion: Continuous growth of the evaluation suite across diverse domains with out-of-the-box configs and guidance.
⚠️ Status
This project is in internal validation; APIs, configs, and docs may change rapidly.
