LLM4CodeSummarization

Code for 《Source Code Summarization in the Era of Large Language Models》

Environment

Our experiment runs with Python 3.7 and Pytorch 1.6.0.

Other packages required can be installed with pip install -r requirements.txt.

Datasets

The datasets used in our experiments can be found here, including human evaluation datasets.

Build Erlang, Haskell and Prolog Dataset

Code for building Erlang, Haskell and Prolog Dataset is in the dataset directory.

cd ./dataset

Crawl data from Github

python crawl.py

Extract <function, summary> pairs

python erlang.py
python haskell.py
python prolog.py

Use LLMs for Code Summarization

Calling LLMs to generate comments

python run.py

Extract comments from LLMs' response

python beautify.py

Evaluate with LLMs

Evaluate with GPT-4 (used for RQ2-RQ5)

python evaluate.py

Evaluate with LLMs on the human evaluation dataset (used for RQ1). File human_eval_record_{language}.csv can be found here.

python llm-eval.py

Results

We upload the results in our experiment here, in which:

codesum directory contains LLMs' response (.csv) and the comment (.txt) extracted from the response
gpt-eval directory contains GPT-4's evaluation scores in RQ2-RQ5
RQ1 directory contains human evaluation scores and evaluation scores of each metric in RQ1

Figures

The directory ./figures contrains examples of five prompting techniques (zero-shot, few-shot, chain-of-thought, critique, expert) , which are not presented in the paper due to page limit.

LLM4CodeSummarization

Install / Use

README

LLM4CodeSummarization

Environment

Datasets

Build Erlang, Haskell and Prolog Dataset

Use LLMs for Code Summarization

Evaluate with LLMs

Results

Figures