GitTaskBench

Repo-level benchmark for real-world Code Agents: from repo understanding → env setup → incremental dev/bug-fixing → task delivery, with cost-aware α metric.

Generate Convert Improve

Install / Use

/learn @QuantaAlpha/GitTaskBench

About this skill

Quality Score

0/100

README

<div align="center"> <h1 align="center" style="color: #2196F3; font-size: 24px; font-weight: 600; margin: 20px 0; line-height: 1.4;"> 🚀 GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging </h1> <a href="https://arxiv.org/pdf/2508.18993"><img src="https://img.shields.io/badge/arXiv-2508.18993-B31B1B.svg?style=flat-square&logo=arxiv&logoColor=white" /></a> <a href="https://gittaskbench.github.io"><img src="https://img.shields.io/badge/🌐_LeaderBoard-GitTaskBench-4A90E2.svg?style=flat-square&logo=github&logoColor=white" /></a> <a href="https://github.com/QuantaAlpha/RepoMaster"><img src="https://img.shields.io/badge/Agent-RepoMaster-4A90E2.svg?style=flat-square&logo=github&logoColor=white" /></a> <a href="https://quantaalpha.github.io/"><img src="https://img.shields.io/badge/Team-QuantaAlpha-00A98F.svg?style=flat-square&logo=opensourceinitiative&logoColor=white" /></a> <a href="https://gittaskbench.github.io/"> <img src="figs/leaderboard.png" width="800" /> </a> </div>

📰 News

2025.09.19 🎉 Excited to announce that our papers have been accepted to NeurIPS 2025 — RepoMaster as a Spotlight (≈3.2%) and SE-Agent as a Poster (≈24.52%)!
2025.08.28 🎉 We open-sourced RepoMaster — an AI agent that leverages GitHub repos to solve complex real-world tasks.
2025.08.26 🎉 We open-sourced GitTaskBench — a repo-level benchmark & tooling suite for real-world tasks.
2025.08.10 🎉 We open-sourced SE-Agent — a self-evolution trajectory framework for multi-step reasoning.

🔗 Ecosystem: RepoMaster · GitTaskBench · SE-Agent · Team Homepage

🧭 Motivation and Goal

The ultimate vision for AI agents is to enable users to accomplish real-world tasks simply by describing their needs in natural language—leaving all planning and execution to the agent, which delivers the final results autonomously.

⚠️ While existing benchmarks evaluate various agent capabilities, few focus on tasks that reflect genuine real-world practicality, especially those requiring comprehensive understanding and use of full-scale project repositories.

👋 To address this gap, we introduce GitTaskBench. Our benchmark focuses on tasks whose complexity and practical value demand leveraging repository-level code, mirroring how developers solve real problems using existing GitHub projects.

<img src="./figs/overview.jpg" width="700" /> Overview of GitTaskBench. 7 example real-life tasks from different modalities and their evaluations are shown.

🔍 We carefully selected 54 representative tasks with real-world economic value, and for each task, searched and identified a corresponding GitHub repository that meets strict selection criteria (the repository for each task is fixed to ensure benchmark completeness, as some agent frameworks do not support searching for appropriate repositories). This setup allows us to systematically evaluate LLM agents' ability to utilize open-source repositories to solve complex, realistic problems.

👉 By doing so, GitTaskBench offers a more authentic and comprehensive assessment of agent performance in practical, repository-driven environments.

🚀 How to Run

⚡ If you only want to know how to use GitTaskBench, start here.

0. Directory structure

└── QuantaAlpha/GitTaskBench/

├── README.md
├── setup.py
├── requirements.txt
├── Task_Success_Criteria.xlsx   # listed clearly
├── code_base/                   # all used repositories
│   ├── AnimeGANv3/
│   └── ...
├── queries/                     # all task definitions
│   ├── AnimeGANv3_01/
│   │   └── query.json
│   ├── AnimeGANv3_02/
│   │   └── query.json
│   └── ...
├── run_auto_prompt/             # generate all prompts
│   ├── new_run_setup.py
│   └── get_new_run_prompt.sh
├── Aider/                       # agent framework
│   └── ... 
├── SWE_agent/                   # agent framework
│   └── ... 
├── OpenHands/                   # agent framework
│   └── ...
├── config/                      # task evaluation configs
│   ├── AnimeGANv3_01/
│   │   └── task_info.yaml
│   ├── AnimeGANv3_02/
│   │   └── task_info.yaml
│   ├── AnimeGANv3_03/
│   └── ...
├── groundtruth/                 # ground truth
│   ├── Trafilatura_02/
│   │   └── gt.md
│   └── Trafilatura_03/...
├── output_for_show/             #  agent's outputs
│   ├── AnimeGANv3_01/
│   │   └── output.png
│   └── AnimeGANv3_02/...
├── gittaskbench/                # evaluation settings
│   ├── __init__.py
│   └── ...
├── test_scripts/                # test scripts
│   ├── AnimeGANv3_01/
│   │   └── test_script.py
│   ├── AnimeGANv3_02/
│   │   └── test_script.py
│   └──...
├── test_results_for_show/       # analysis results
│   ├── AnimeGANv3_02/
│   │   └── results.jsonl
│   └──...
└── test_reports/                # summary report
    ├── evaluation_report_openhands_gpt4o_100iters.txt
    ├── evaluation_report_openhands_gpt4o_70iters.txt
    ├── evaluation_report_openhands_gpt4o_30iters.txt
    └── ...

1. Set Up ⚙️

<a name="set-up"></a> GitTaskBench offers easy-to-use shell commands to ensure reproducible evaluations. To build GitTaskBench from source, follow bellow steps.

First, create a new conda environment:

conda create -n gittaskbench python=3.10 -y
conda activate gittaskbench

pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 \
  --extra-index-url https://download.pytorch.org/whl/cu113

Then, you can install gittaskbench with pip:

git clone https://github.com/your-org/GitTaskBench.git
cd GitTaskBench
# config
pip install -e .

also you can

# config
pip install -r requirements.txt

2. Quick Start 💡

Single Task Evaluation:

If you need to evaluate a single, specific task, you can use the following command. The example below shows how to evaluate the Trafilatura_01 task:

cd GitTaskBench
# The outputs are saved in the DEFAULT "./output" directory, for example: "./output/Trafilatura_01/output.txt"

gittaskbench grade --taskid Trafilatura_01

Running the command will produce an analysis report (.jsonl) at the DEFAULT path (./test_results/Trafilatura_01). See test_results_for_show/ for a sample.

The complete commands can be found in the 🤖 Automation Evaluation section.

All Tasks Evaluation

When you need to evaluate all tasks, you can use the --all parameter. This command will automatically iterate through and execute the evaluation of all tasks:

gittaskbench grade --all

Test Results Analysis

After completing the evaluation, if you want to analyze & summary the test results, you can use the statistics command. This command will analyze & summary the evaluation results in the specified directory and output an analysis report (.txt):

gittaskbench eval

See test_reports/ for a sample.

👉 That’s it. With the above commands you can run, and analyze the agent performance on GitTaskBench.

📊 Benchmark Overview

GitTaskBench is a comprehensive benchmark designed to evaluate the capabilities of intelligent agents across multiple modalities and task complexities. It encompasses 54 tasks spanning 7 key domains.

Each domain features a curated set of tasks that reflect real-world applications and research challenges. These tasks assess an agent's autonomous ability to interpret complex instructions, process multi-modal inputs, perform reasoning, understand and explore the GitHub repositories, and deliver accurate, meaningful outputs.

The GitTaskBench data curation and processing pipeline is illustrated below.

<img src="./figs/data_construction.jpg" width="900" /> Overview of the GitTaskBench data curation and processing pipeline.

✅ Task Distribution

| Domain | Task List | |----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Image Processing | Style Transfer, Image Coloring, Image Restoration, Scratch Detection, Image Enhancement, Background Processing, Watermark Embedding | | Video Processing | Video Action Analysis, Style Transfer, Video Coloring | | Speech Processing | Speech Recognition, Speech Separation, Speech Enhancement, Noise Reduction, Speech Analysis | | Physiological Signals Processing | EDA (Electrodermal Activity) Data Analysis, ECG (Electrocardiogram) Data Analysis, EOG (Electrooculogram) Data Analysis | | Security and

Related Skills

node-connect

340.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

340.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.1k

Commit, push, and open a PR

QuantaAlpha

View profile

View on GitHub

GitHub Stars252

CategoryDevelopment

Updated12d ago

Forks20

QuantaAlpha/GitTaskBench

Languages

Python

Security Score

85/100

Audited on Mar 17, 2026

No findings