CodePlan

Data and evaluation scripts for "CodePlan: Repository-level Coding using LLMs and Planning", FSE 2024

Generate Convert Improve

Install / Use

/learn @microsoft/CodePlan

About this skill

Quality Score

0/100

README

Replication Package

For the paper titled "CodePlan: Repository-level Coding using LLMs and Planning", accepted FSE 2024.

Pre-print: CodePlan paper

This package contains the following

data/: Benchmark of repository edits along with output of our approach and baselines.
scripts/: Script to compute key metrics presented in the paper.

Data

Data for each repository edit is presented in different directory with the names used in the paper. Each these contains three sub-directories - (1) source: repository before the edit, target: repository after the ground truth edit and pred: output of our approach and baselines. pred contains different sub-directories for each approach, where each subdirectory contains repo: state of the repository after edits, blocks: matched, missed and spurious blocks in the output, metrics.json: all metrics computed for the repo, diff.html: pretty printed textual diff of the output and source (along with diff between source and target for comparison).

data
| -- t1
      | -- source
      | -- target
      | -- pred
            | -- codeplan
                  | -- repo
                  | -- blocks
                  | -- metrics.json
                  | -- diff.html
            | -- repair
                  | -- repo
                  | -- blocks
                  | -- diff.html
            ...
...

Quick Start!

To inspect the metrics for a particular repository edit, just navigate to its directory and open metrics.json. To inspect the code change, you can similarly open diff.html

Replicating results

Code Setup

The scripts require python>=3.11 along with the following packages -

tqdm
evaluate
textdistance

Computing metrics

The main script for computing metrics is scripts/eval.py with the following options -

usage: eval.py [-h] [--repo {ext1,ext2,t1,t2,t3}]       
               [--approach APPROACH] [--all]
               [--levenstein] [--save_default]
               [--save_path SAVE_PATH] [--verbose]      
               [--debug]

options:
  -h, --help            show this help message and exit 
  --repo {ext1,ext2,t1,t2,t3}, -r {ext1,ext2,t1,t2,t3}  
                        Repo name to compute metrics    
                        for.
  --approach APPROACH, -a APPROACH
                        Approach name to compute metrics
                        for.
  --all                 Compute metrics for all
                        approaches on all repos.        
  --levenstein          Compute levenstein distance     
                        metric. Note that this may take 
                        a really long time
  --save_default        Save metrics to default
                        location.
  --save_path SAVE_PATH, -s SAVE_PATH
                        Path to save json with all      
                        computed metrics to.
  --verbose, -v         Enable verbose (info) logging.    
  --debug, -d           Enable debug logging.

Note that the names of the approach must match one of the sub-directories present within the pred directory for the repo being evaluated.

For example to compute metrics for approach codeplan on repo t1 and save the results to t1_codeplan_stats.json the following command can be used -

python scripts/eval.py --repo t1 --approach codeplan --save_path t1_codeplan_stats.json

This will compute text metrics (DiffBLEU, Levenshtein Distance) and block metrics (matched, missing spurious), print out a summary corresponding to a row in Table 3 of the paper and store detailed file-wise metrics to the provided path.

Related Skills

node-connect

350.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.9k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

350.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

350.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

microsoft

View profile

View on GitHub

GitHub Stars84

CategoryDevelopment

Updated10d ago

Forks13

microsoft/CodePlan

Languages

Security Score

95/100

Audited on Mar 27, 2026

No findings

CodePlan

Install / Use

README

Replication Package

Contents

Data

Quick Start!

Replicating results

Code Setup

Computing metrics

Related Skills