CodeSearchNet
Datasets, tools, and benchmarks for representation learning of code.
Install / Use
/learn @github/CodeSearchNetREADME
The CodeSearchNet challenge has been concluded
We would like to thank all participants for their submissions and we hope that this challenge provided insights to practitioners and researchers about the challenges in semantic code search and motivated new research. We would like to encourage everyone to continue using the dataset and the human evaluations, which we now provide publicly. Please, see below for details, specifically the Evaluation section.
No new submissions to the challenge will be accepted.
Table of Contents
<!-- TOC depthFrom:1 depthTo:6 withLinks:1 updateOnSave:1 orderedList:0 --> <!-- /TOC -->Quickstart
If this is your first time reading this, we recommend skipping this section and reading the following sections. The below commands assume you have Docker and Nvidia-Docker, as well as a GPU that supports CUDA 9.0 or greater. Note: you should only have to run script/setup once to download the data.
# clone this repository
git clone https://github.com/github/CodeSearchNet.git
cd CodeSearchNet/
# download data (~3.5GB) from S3; build and run the Docker container
script/setup
# this will drop you into the shell inside a Docker container
script/console
# optional: log in to W&B to see your training metrics,
# track your experiments, and submit your models to the benchmark
wandb login
# verify your setup by training a tiny model
python train.py --testrun
# see other command line options, try a full training run with default values,
# and explore other model variants by extending this baseline script
python train.py --help
python train.py
# generate predictions for model evaluation
python predict.py -r github/CodeSearchNet/0123456 # this is the org/project_name/run_id
Finally, you can submit your run to the community benchmark by following these instructions.
Introduction
Project Overview
CodeSearchNet is a collection of datasets and benchmarks that explore the problem of code retrieval using natural language. This research is a continuation of some ideas presented in this blog post and is a joint collaboration between GitHub and the Deep Program Understanding group at Microsoft Research - Cambridge. We aim to provide a platform for community research on semantic code search via the following:
- Instructions for obtaining large corpora of relevant data
- Open source code for a range of baseline models, along with pre-trained weights
- Baseline evaluation metrics and utilities
- Mechanisms to track progress on a shared community benchmark hosted by Weights & Biases
We hope that CodeSearchNet is a step towards engaging with the broader machine learning and NLP community regarding the relationship between source code and natural language. We describe a specific task here, but we expect and welcome other uses of our dataset.
More context regarding the motivation for this problem is in this technical report. Please, cite the dataset and the challenge as
@article{husain2019codesearchnet,
title={{CodeSearchNet} challenge: Evaluating the state of semantic code search},
author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
journal={arXiv preprint arXiv:1909.09436},
year={2019}
}
Data
The primary dataset consists of 2 million (comment, code) pairs from open source libraries. Concretely, a comment is a top-level function or method comment (e.g. docstrings in Python), and code is an entire function or method. Currently, the dataset contains Python, Javascript, Ruby, Go, Java, and PHP code. Throughout this repo, we refer to the terms docstring and query interchangeably. We partition the data into train, validation, and test splits such that code from the same repository can only exist in one partition. Currently this is the only dataset on which we train our model. Summary statistics about this dataset can be found in this notebook
For more information about how to obtain the data, see this section.
Evaluation
The metric we use for evaluation is Normalized Discounted Cumulative Gain. Please reference this paper for further details regarding model evaluation. The evaluation script can be found here.
Annotations
We manually annotated retrieval results for the six languages from 99 general queries. This dataset is used as groundtruth data for evaluation only. Please refer to this paper for further details on the annotation process. These annotations were used to compute the scores in the leaderboard. Now that the competition has been concluded, you can find the annotations, along with the annotator comments here.
Setup
You should only have to perform the setup steps once to download the data and prepare the environment.
-
Due to the complexity of installing all dependencies, we prepared Docker containers to run this code. You can find instructions on how to install Docker in the official docs. Additionally, you must install Nvidia-Docker to satisfy GPU-compute related dependencies. For those who are new to Docker, this blog post provides a gentle introduction focused on data science.
-
After installing Docker, you need to download the pre-processed datasets, which are hosted on S3. You can do this by running
script/setup.script/setupThis will build Docker containers and download the datasets. By default, the data is downloaded into the
resources/data/folder inside this repository, with the directory structure described here.
The datasets you will download (most of them compressed) have a combined size of only ~ 3.5 GB.
- To start the Docker container, run
script/console:
This will land you inside the Docker container, starting in thescript/console/srcdirectory. You can detach from/attach to this container to pause/continue your work.
For more about the data, see Data Details below, as well as this notebook.
Data Details
Data Acquisition
If you have run the setup steps above you will already have the data, and nothing more needs to be done. The data will be available in the /resources/data folder of this repository, with this directory structure.
Schema & Format
Data is stored in jsonlines format. Each line in the uncompressed file represents one example (usually a function with an associated comment). A prettified example of one row is illustrated below.
- repo: the owner/repo
- path: the full path to the original file
- func_name: the function or method name
- original_string: the raw string before tokenization or parsing
- language: the programming language
- code: the part of the
original_stringthat is code - code_tokens: tokenized version of
code - docstring: the top-level comment or docstring, if it exists in the original string
- docstring_tokens: tokenized version of
docstring - sha: this field is not being used [TODO: add note on where this comes from?]
- partition: a flag indicating what partition this datum belongs to of {train, valid, test, etc.} This is not used by the model. Instead we rely on directory structure to denote the partition of the data.
- url: the url for the code snippet including the line numbers
Code, comments, and docstrings are extracted in a language-specific manner, removing artifacts of that language.
{
'code': 'def get_vid_from_url(url):\n'
' """Extracts video ID from URL.\n'
' """\n'
"
