CodeAssistBench
No description available
Install / Use
/learn @amazon-science/CodeAssistBenchREADME
CodeAssistBench
A benchmark for evaluating AI coding assistants on real GitHub issues. This project includes a curated dataset of GitHub issues with Dockerfiles for reproducible evaluation, plus tools for dataset creation and AI agent evaluation.
⚡ Quick Run (5 minutes)
Get started immediately with our pre-built dataset:
# 1. Clone and install
git clone https://github.com/your-org/CodeAssistBench.git
cd CodeAssistBench
pip install -r requirements.txt && pip install -e .
# 2. Set AWS credentials (for Bedrock)
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-west-2
# 3. Run evaluation on Python issues (using human-verified dataset)
python -m cab_evaluation.cli generation-dataset \
dataset/cab_verified_v2.jsonl \
--output results/quick_test.jsonl \
--agent-models '{"maintainer": "haiku", "user": "haiku"}' \
--language python
# 4. Judge the results
python -m cab_evaluation.cli evaluation-dataset \
results/quick_test.jsonl \
--output results/quick_eval.jsonl \
--agent-models '{"judge": "haiku"}'
# 5. View results
python -c "
import json
with open('results/quick_eval.jsonl') as f:
for line in f:
r = json.loads(line)
print(f\"{r['issue_id']}: {r['verdict']}\")
"
What this does:
- Generates maintainer responses for Python issues using Claude Haiku (fast & cheap)
- Evaluates responses with a judge agent
- Outputs verdicts:
CORRECT,PARTIALLY_CORRECT,INCORRECT, orERROR
For production evaluation, use sonnet4 or opus models instead of haiku.
📊 Dataset Overview
CodeAssistBench provides four ready-to-use datasets. We recommend cab_verified_v2.jsonl for evaluation — it contains 274 human-verified, high-quality issues (scored 4+ out of 5 by annotators):
| Dataset | Issues | Languages | Description |
|---------|--------|-----------|-------------|
| dataset/cab_verified_v2.jsonl | 274 | 7 | ⭐ Recommended — Human-verified subset from annotation |
| dataset/cab_recent_v2.jsonl | 771 | 7 | Full dataset — June 2025 - Jan 2026 (with satisfaction conditions & classification) |
| dataset/cab_recent.jsonl | 308 | 7 | Earlier recent issues (June 2025 - Jan 2026) |
| dataset/cab_verified.jsonl | 149 | 7 | Legacy verified subset with tested Dockerfiles |
Dataset Fields
Each issue in the dataset contains:
{
"number": 1234,
"title": "Bug: Memory leak in parser",
"created_at": "2025-07-15T10:30:00Z",
"closed_at": "2025-07-20T14:22:00Z",
"commit_id": "abc123def456...",
"labels": ["bug", "parser"],
"url": "https://github.com/owner/repo/issues/1234",
"body": "When parsing large files, memory usage grows unbounded...",
"author": "user123",
"comments": [
{
"user": "maintainer",
"created_at": "2025-07-16T08:00:00Z",
"body": "Thanks for reporting! Can you share the file?"
}
],
"satisfaction_conditions": [
"Memory usage remains stable when parsing files >100MB",
"Parser handles all edge cases mentioned in the issue",
"No regression in parsing speed for normal files"
],
"_classification": {
"category": "Can be dockerized without any issue",
"timestamp": "2025-04-14 01:01:54"
},
"dockerfile": "FROM python:3.11-slim\n...",
"language": "python"
}
🛠️ Step-by-Step: Generate Your Own Dataset
This section walks through how we generated the dataset from scratch using AWS Bedrock and Strands AI agents.
Prerequisites
# 1. Clone and setup
git clone https://github.com/your-org/CodeAssistBench.git
cd CodeAssistBench
python3 -m venv venv
source venv/bin/activate
# 2. Install dependencies
pip install -r requirements.txt
pip install -e .
# 3. Install Strands SDK (required for Dockerfile generation)
pip install strands-agents strands-agents-tools
pip install -e tools/
# 4. Set up LLM credentials (choose ONE option)
# Option A: AWS Bedrock (Claude models)
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-west-2
# Option B: OpenAI (GPT-5 models)
export OPENAI_API_KEY=your_openai_api_key
# 5. Set up GitHub token (for API access)
export GITHUB_TOKEN=your_github_personal_access_token
Step 1: Collect GitHub Issues
Collect closed issues from popular repositories. The script uses interactive prompts:
python script/get_github_issue.py
# Enter CSV path when prompted (see script/python_repos*.csv for examples)
# Choose label-based filtering (y/n)
Or use the bulk collection script:
python script/collect_1000_issues.py
# Edit the script to set: language, min_stars, date range
Output: github_issues_<owner>_<repo>_<timestamp>.json
[
{
"number": 1234,
"title": "Bug: Memory leak in parser",
"url": "https://github.com/owner/repo/issues/1234",
"body": "When parsing large files...",
"comments": [...]
}
]
Step 2: Get Commit IDs
Find the commit hash at the time each issue was closed:
python script/get_github_commit.py \
--input-dir my_data/collected_issues \
--output-dir my_data/with_commits
# Or using short options:
python script/get_github_commit.py -i my_data/collected_issues -o my_data/with_commits
Arguments:
| Argument | Required | Description |
|----------|----------|-------------|
| --input-dir, -i | Yes | Directory containing JSON files with issues |
| --output-dir, -o | No | Output directory (default: github_commits) |
Output: Creates commit data files in the output directory.
Step 3: Generate Satisfaction Conditions (Uses LLM)
Use LLM to generate explicit criteria for issue resolution:
python script/scon_filter.py \
--input-dir my_data/collected_issues \
--output-dir my_data/with_scon
# With custom model and region:
python script/scon_filter.py \
-i my_data/collected_issues \
-o my_data/with_scon \
--model us.anthropic.claude-sonnet-4-5-20250929-v1:0 \
--region us-west-2
Arguments:
| Argument | Required | Default | Description |
|----------|----------|---------|-------------|
| --input-dir, -i | Yes | - | Directory containing JSON files with issues |
| --output-dir, -o | Yes | - | Output directory for issues with satisfaction conditions |
| --model, -m | No | claude-sonnet-4.5 | Bedrock model ID |
| --region, -r | No | us-west-2 | AWS region for Bedrock |
Output: Adds satisfaction_conditions field:
{
"satisfaction_conditions": [
"Memory usage remains stable when parsing files >100MB",
"Parser handles all edge cases mentioned in the issue"
]
}
Step 4: Classify Dockerizability (Uses LLM)
Classify issues by whether they need a Docker environment:
python script/docker_filter.py \
--input-dir my_data/with_scon \
--output-dir my_data/classified
# With custom region:
python script/docker_filter.py \
-i my_data/with_scon \
-o my_data/classified \
--region us-east-1
Arguments:
| Argument | Required | Default | Description |
|----------|----------|---------|-------------|
| --input-dir, -i | Yes | - | Directory containing JSON files with issues |
| --output-dir, -o | Yes | - | Output directory for classified issues |
| --region, -r | No | us-west-2 | AWS region for Bedrock |
Output structure:
my_data/classified/
├── need_docker/ # Issues that need Docker environment
├── no_need_docker/ # Documentation/config changes
├── need_docker_but_cannot/ # Hardware-specific issues
├── llm_responses/ # Raw LLM responses for debugging
└── processed_issues.json # Resume checkpoint
Step 5: Generate Dockerfiles (Uses Strands + LLM)
⚠️ This step requires Strands AI agents to automatically generate and test Dockerfiles:
# Option A: Using AWS Bedrock (Claude) - default
STRANDS_NON_INTERACTIVE=true BYPASS_TOOL_CONSENT=true \
python script/generate_dockerfile_with_strands.py \
--input-dir my_data/classified/need_docker \
--languages python \
--max-attempts 3 \
--parallel 2 \
--agent-timeout 180 \
--issue-timeout 600
# Option B: Using OpenAI (GPT-5)
STRANDS_NON_INTERACTIVE=true BYPASS_TOOL_CONSENT=true \
python script/generate_dockerfile_with_strands.py \
--input-dir my_data/classified/need_docker \
--languages python \
--max-attempts 3 \
--parallel 2 \
--agent-timeout 180 \
--issue-timeout 600 \
--model-id gpt5 \
--provider openai
What happens:
- Strands agent reads the issue and repository structure
- Agent generates a Dockerfile based on repo's build system
- Docker builds the image to verify it works
- If build fails, agent iterates with error feedback
- Success: Dockerfile is saved to the issue JSON
Output: Adds dockerfile field:
{
"dockerfile": "FROM python:3.11-slim\n\nWORKDIR /workspace\n\nRUN apt-get update && apt-get install -y git\n\nRUN git clone https://github.com/owner/repo.git . && \\\n git checkout abc123def456\n\nRUN pip install -r requirements.txt\n\nCMD [\"pytest\", \"tests/\"]\n"
}
Step 6: Convert to Final Dataset
Combine all processed issues into a single JSONL file:
python script/convert_to_jsonl.py \
--input-dir my_data/classified/need_docker \
--output my_data/my_dataset.jsonl
🧪 End-to-End Example
Here's a complete walkthrough processing 2 test issues through the entire pipeline:
Setup
cd CodeAssistBench
# Set up credentials (AWS Bedrock + GitHub)
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-west-2
export GITHUB_TOKEN=your_github_token
Step 1: Create Test Data
Create a directory with sample issues:
mkdir -p test_pipeline/step1_raw
Create test_pipeline/step1_raw/test_issues.json:
[
{
"number": 1234,
"title": "How to handle async operations in Python?",
"created_at": "2025-07-15T10:30:00Z",
