CodeAssistBench

A benchmark for evaluating AI coding assistants on real GitHub issues. This project includes a curated dataset of GitHub issues with Dockerfiles for reproducible evaluation, plus tools for dataset creation and AI agent evaluation.

⚡ Quick Run (5 minutes)

Get started immediately with our pre-built dataset:

# 1. Clone and install
git clone https://github.com/your-org/CodeAssistBench.git
cd CodeAssistBench
pip install -r requirements.txt && pip install -e .

# 2. Set AWS credentials (for Bedrock)
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-west-2

# 3. Run evaluation on Python issues (using human-verified dataset)
python -m cab_evaluation.cli generation-dataset \
  dataset/cab_verified_v2.jsonl \
  --output results/quick_test.jsonl \
  --agent-models '{"maintainer": "haiku", "user": "haiku"}' \
  --language python

# 4. Judge the results
python -m cab_evaluation.cli evaluation-dataset \
  results/quick_test.jsonl \
  --output results/quick_eval.jsonl \
  --agent-models '{"judge": "haiku"}'

# 5. View results
python -c "
import json
with open('results/quick_eval.jsonl') as f:
    for line in f:
        r = json.loads(line)
        print(f\"{r['issue_id']}: {r['verdict']}\")
"

What this does:

Generates maintainer responses for Python issues using Claude Haiku (fast & cheap)
Evaluates responses with a judge agent
Outputs verdicts: CORRECT, PARTIALLY_CORRECT, INCORRECT, or ERROR

For production evaluation, use sonnet4 or opus models instead of haiku.

📊 Dataset Overview

CodeAssistBench provides four ready-to-use datasets. We recommend cab_verified_v2.jsonl for evaluation — it contains 274 human-verified, high-quality issues (scored 4+ out of 5 by annotators):

| Dataset | Issues | Languages | Description | |---------|--------|-----------|-------------| | dataset/cab_verified_v2.jsonl | 274 | 7 | ⭐ Recommended — Human-verified subset from annotation | | dataset/cab_recent_v2.jsonl | 771 | 7 | Full dataset — June 2025 - Jan 2026 (with satisfaction conditions & classification) | | dataset/cab_recent.jsonl | 308 | 7 | Earlier recent issues (June 2025 - Jan 2026) | | dataset/cab_verified.jsonl | 149 | 7 | Legacy verified subset with tested Dockerfiles |

Dataset Fields

Each issue in the dataset contains:

{
  "number": 1234,
  "title": "Bug: Memory leak in parser",
  "created_at": "2025-07-15T10:30:00Z",
  "closed_at": "2025-07-20T14:22:00Z",
  "commit_id": "abc123def456...",
  "labels": ["bug", "parser"],
  "url": "https://github.com/owner/repo/issues/1234",
  "body": "When parsing large files, memory usage grows unbounded...",
  "author": "user123",
  "comments": [
    {
      "user": "maintainer",
      "created_at": "2025-07-16T08:00:00Z",
      "body": "Thanks for reporting! Can you share the file?"
    }
  ],
  "satisfaction_conditions": [
    "Memory usage remains stable when parsing files >100MB",
    "Parser handles all edge cases mentioned in the issue",
    "No regression in parsing speed for normal files"
  ],
  "_classification": {
    "category": "Can be dockerized without any issue",
    "timestamp": "2025-04-14 01:01:54"
  },
  "dockerfile": "FROM python:3.11-slim\n...",
  "language": "python"
}

🛠️ Step-by-Step: Generate Your Own Dataset

This section walks through how we generated the dataset from scratch using AWS Bedrock and Strands AI agents.

Prerequisites

# 1. Clone and setup
git clone https://github.com/your-org/CodeAssistBench.git
cd CodeAssistBench
python3 -m venv venv
source venv/bin/activate

# 2. Install dependencies
pip install -r requirements.txt
pip install -e .

# 3. Install Strands SDK (required for Dockerfile generation)
pip install strands-agents strands-agents-tools
pip install -e tools/

# 4. Set up LLM credentials (choose ONE option)

# Option A: AWS Bedrock (Claude models)
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-west-2

# Option B: OpenAI (GPT-5 models)
export OPENAI_API_KEY=your_openai_api_key

# 5. Set up GitHub token (for API access)
export GITHUB_TOKEN=your_github_personal_access_token

Step 1: Collect GitHub Issues

Collect closed issues from popular repositories. The script uses interactive prompts:

python script/get_github_issue.py
# Enter CSV path when prompted (see script/python_repos*.csv for examples)
# Choose label-based filtering (y/n)

Or use the bulk collection script:

python script/collect_1000_issues.py
# Edit the script to set: language, min_stars, date range

Output: github_issues_<owner>_<repo>_<timestamp>.json

[
  {
    "number": 1234,
    "title": "Bug: Memory leak in parser",
    "url": "https://github.com/owner/repo/issues/1234",
    "body": "When parsing large files...",
    "comments": [...]
  }
]

Step 2: Get Commit IDs

Find the commit hash at the time each issue was closed:

python script/get_github_commit.py \
  --input-dir my_data/collected_issues \
  --output-dir my_data/with_commits

# Or using short options:
python script/get_github_commit.py -i my_data/collected_issues -o my_data/with_commits

Arguments: | Argument | Required | Description | |----------|----------|-------------| | --input-dir, -i | Yes | Directory containing JSON files with issues | | --output-dir, -o | No | Output directory (default: github_commits) |

Output: Creates commit data files in the output directory.

Step 3: Generate Satisfaction Conditions (Uses LLM)

Use LLM to generate explicit criteria for issue resolution:

python script/scon_filter.py \
  --input-dir my_data/collected_issues \
  --output-dir my_data/with_scon

# With custom model and region:
python script/scon_filter.py \
  -i my_data/collected_issues \
  -o my_data/with_scon \
  --model us.anthropic.claude-sonnet-4-5-20250929-v1:0 \
  --region us-west-2

Arguments: | Argument | Required | Default | Description | |----------|----------|---------|-------------| | --input-dir, -i | Yes | - | Directory containing JSON files with issues | | --output-dir, -o | Yes | - | Output directory for issues with satisfaction conditions | | --model, -m | No | claude-sonnet-4.5 | Bedrock model ID | | --region, -r | No | us-west-2 | AWS region for Bedrock |

Output: Adds satisfaction_conditions field:

{
  "satisfaction_conditions": [
    "Memory usage remains stable when parsing files >100MB",
    "Parser handles all edge cases mentioned in the issue"
  ]
}

Step 4: Classify Dockerizability (Uses LLM)

Classify issues by whether they need a Docker environment:

python script/docker_filter.py \
  --input-dir my_data/with_scon \
  --output-dir my_data/classified

# With custom region:
python script/docker_filter.py \
  -i my_data/with_scon \
  -o my_data/classified \
  --region us-east-1

Arguments: | Argument | Required | Default | Description | |----------|----------|---------|-------------| | --input-dir, -i | Yes | - | Directory containing JSON files with issues | | --output-dir, -o | Yes | - | Output directory for classified issues | | --region, -r | No | us-west-2 | AWS region for Bedrock |

Output structure:

my_data/classified/
├── need_docker/           # Issues that need Docker environment
├── no_need_docker/        # Documentation/config changes  
├── need_docker_but_cannot/ # Hardware-specific issues
├── llm_responses/         # Raw LLM responses for debugging
└── processed_issues.json  # Resume checkpoint

Step 5: Generate Dockerfiles (Uses Strands + LLM)

⚠️ This step requires Strands AI agents to automatically generate and test Dockerfiles:

# Option A: Using AWS Bedrock (Claude) - default
STRANDS_NON_INTERACTIVE=true BYPASS_TOOL_CONSENT=true \
python script/generate_dockerfile_with_strands.py \
  --input-dir my_data/classified/need_docker \
  --languages python \
  --max-attempts 3 \
  --parallel 2 \
  --agent-timeout 180 \
  --issue-timeout 600

# Option B: Using OpenAI (GPT-5)
STRANDS_NON_INTERACTIVE=true BYPASS_TOOL_CONSENT=true \
python script/generate_dockerfile_with_strands.py \
  --input-dir my_data/classified/need_docker \
  --languages python \
  --max-attempts 3 \
  --parallel 2 \
  --agent-timeout 180 \
  --issue-timeout 600 \
  --model-id gpt5 \
  --provider openai

What happens:

Strands agent reads the issue and repository structure
Agent generates a Dockerfile based on repo's build system
Docker builds the image to verify it works
If build fails, agent iterates with error feedback
Success: Dockerfile is saved to the issue JSON

Output: Adds dockerfile field:

{
  "dockerfile": "FROM python:3.11-slim\n\nWORKDIR /workspace\n\nRUN apt-get update && apt-get install -y git\n\nRUN git clone https://github.com/owner/repo.git . && \\\n    git checkout abc123def456\n\nRUN pip install -r requirements.txt\n\nCMD [\"pytest\", \"tests/\"]\n"
}

Step 6: Convert to Final Dataset

Combine all processed issues into a single JSONL file:

python script/convert_to_jsonl.py \
  --input-dir my_data/classified/need_docker \
  --output my_data/my_dataset.jsonl

🧪 End-to-End Example

Here's a complete walkthrough processing 2 test issues through the entire pipeline:

Setup

cd CodeAssistBench

# Set up credentials (AWS Bedrock + GitHub)
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-west-2
export GITHUB_TOKEN=your_github_token

Step 1: Create Test Data

Create a directory with sample issues:

mkdir -p test_pipeline/step1_raw

Create test_pipeline/step1_raw/test_issues.json:

[
  {
    "number": 1234,
    "title": "How to handle async operations in Python?",
    "created_at": "2025-07-15T10:30:00Z",

CodeAssistBench

Install / Use

README

CodeAssistBench

⚡ Quick Run (5 minutes)

📊 Dataset Overview

Dataset Fields

🛠️ Step-by-Step: Generate Your Own Dataset

Prerequisites

Step 1: Collect GitHub Issues

Step 2: Get Commit IDs

Step 3: Generate Satisfaction Conditions (Uses LLM)

Step 4: Classify Dockerizability (Uses LLM)

Step 5: Generate Dockerfiles (Uses Strands + LLM)

Step 6: Convert to Final Dataset

🧪 End-to-End Example

Setup

Step 1: Create Test Data