SkillAgentSearch skills...

Monitors4codegen

Code and Data artifact for NeurIPS 2023 paper - "Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context". `multispy` is a lsp client library in Python intended to be used to build applications around language servers.

Install / Use

/learn @microsoft/Monitors4codegen

README

Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context

Alternative title: Guiding Language Models of Code with Global Context using Monitors

Introduction

This repository hosts the official code and data artifact for the paper "Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context" appearing at NeurIPS 2023 ("Guiding Language Models of Code with Global Context using Monitors" on Arxiv). The work introduces Monitor-Guided Decoding (MGD) for code generation using Language Models, where a monitor uses static analysis to guide the decoding.

Repository Contents

  1. Datasets: PragmaticCode and DotPrompts
  2. Evaluation scripts: Scripts to evaluate LMs by taking as input inferences (code generated by the model) for examples in DotPrompts and producing score@k scores for the metrics reported in the paper: Compilation Rate (CR), Next-Identifier Match (NIM), Identifier-Sequence Match (ISM) and Prefix Match (PM).
  3. Inference Results over DotPrompts: Generated code for examples in DotPrompts with various model configurations reported in the paper. The graphs and tables reported in the paper can be reproduced by running the evaluation scripts on the provided inference results.
  4. multilspy: A cross-platform library designed to simplify the process of creating language server clients to query and obtain results of various static analyses from a wide variety of language servers that communicate over the Language Server Protocol. multilspy is intended to be used as a library to easily query various language servers, without having to worry about setting up their configurations and implementing the client-side of language server protocol. multilspy currently supports running language servers for Java, Rust, C# and Python, and we aim to expand this list with the help of the community.
  5. Monitor-Guided Decoding: Implementation of various monitors monitoring for different properties reported in the paper (for example: monitoring for type-valid identifier dereferences, monitoring for correct number of arguments to method calls, monitoring for typestate validity of method call sequences, etc.), spanning 3 programming languages.

The multilspy library has now been migrated to microsoft/multilspy.

Monitor-Guided Decoding: Motivating Example

For example, consider the partial code to be completed in the figure below. To complete this code, an LM has to generate identifiers consistent with the type of the object returned by ServerNode.Builder.newServerNode(). The method newServerNode and its return type, class ServerNode.Builder, are defined in another file. If an LM does not have information about the ServerNode.Builder type, it ends up hallucinating, as can be seen in the example generations with the text-davinci-003 and SantaCoder models. The completion uses identifiers host and port, which do not exist in the type ServerNode.Builder. The generated code therefore results in “symbol not found” compilation errors.

MGD uses static analysis to guide the decoding of LMs, to generate code following certain properties. In the example, MGD is used to monitor for generating code with type-correct dereferences, and the SantaCoder model with the same prompt is able to generate the correct code completion, which compiles and matches the ground truth as well.

As reported in the paper, we observe that MGD can improve the compilation rate of code generated by LMs at all scales (350M-175B) by 19-25%, without any training/fine-tuning required. Further, it boosts the ground-truth match at all granularities from token-level to method-level code completion.

1. Datasets

Dataset Statistics

||| |--------------|:-----:| | Number of repositories in PragmaticCode | 100 | | Number of methods in DotPrompts | 1420 | | Number of examples in DotPrompts | 10538 |

PragmaticCode

PragmaticCode is a dataset of real-world open-source Java projects complete with their development environments and dependencies (through their respective build systems). The authors tried to ensure that all the repositories in PragmaticCode were released publicly only after the determined training dataset cutoff date (31 March 2022) for the CodeGen, SantaCoder and text-davinci-003 family of models, which were used to evaluate MGD.

The full dataset, along with repository zip files is available in our Zenodo dataset release at https://zenodo.org/records/10072088. The list of repositories along with their respective licenses consisting PragmaticCode is available in datasets/PragmaticCode/repos.csv. The contents of the files required for inference for each of the repositories is available in datasets/PragmaticCode/fileContentsByRepo.json.

DotPrompts

DotPrompts is a set of examples derived from PragmaticCode, such that each example consists of a prompt to a dereference location (a code location having the "." operator in Java). DotPrompts can be used to benchmark Language Models of Code on their ability to utilize repository level context to generate code for method-level completion tasks. The task for the models is to complete a partially written Java method, utilizing the full repository available from PragmaticCode. Since all the repositories in PragmaticCode are buildable, DotPrompts (derived from PragmaticCode) supports Compilation Rate as a metric of evaluation for generated code, apart from standard metrics of ground truth match like Next-Identifier Match, Identifier Sequence Match and Prefix Match.

The scenario described in motivating example above is an example in DotPrompts.

The complete description of an example in DotPrompts is a tuple - (repo, classFileName, methodStartIdx, methodStopIdx, dot_idx). The dataset is available at datasets/DotPrompts/dataset.csv.

2. Evaluation Scripts

Environment Setup

We use the Python packages listed in requirements.txt. Our experiments used python 3.10. It is recommended to install the same with dependencies in an isolated virtual environment. To create a virtual environment using venv:

python3 -m venv venv_monitors4codegen
source venv_monitors4codegen/bin/activate

or using conda:

conda create -n monitors4codegen python=3.10
conda activate monitors4codegen

Further details and instructions on creation of python virtual environments can be found in the official documentation. Further, we also refer users to Miniconda, as an alternative to the above steps for creation of the virtual environment.

To install the requirements for running evaluations as described below:

pip3 install -r requirements.txt

Running the evaluation script

The evaluation script can be run as follows:

python3 eval_results.py <path to inference results - csv> <path to PragmaticCode filecontents - json> <path to output directory>

The above command will create a directory <path to output directory>, containing all the graphs and tables reported in the paper along with extra details. The command also generates a report in the output directory, named Report.md which relates the generated figures to sections in the paper.

To ensure that the environment setup has been done correctly, please run the below command, which runs the evaluation script over dummy data (included in inference_results/dotprompts_results_sample.csv). If the command fails, that indicates an error in the environment setup and the authors request you to kindly report the same.

python3 evaluation_scripts/eval_results.py inference_results/dotprompts_results_sample.csv datasets/PragmaticCode/fileContentsByRepo.json results_sample/

Description of inference results csv file format

Description of expected columns in the inference results csv input to the evaluation script:

  • repo: Name of the repository from which the testcase was sourced
  • classFileName: relative path to file containing the testcase prompt location
  • methodStartIdx: String index of starting '{' of the method
  • methodStopIdx: String index of closing '}' of the method
  • dot_idx: String index of '.' that is the dereference prompt point
  • configuration: Identifies the configuration used to generate the given code sample. Values from: ['SC-classExprTypes', 'CG-6B', 'SC-FIM-classExprTypes', 'SC-RLPG-MGD', 'SC-MGD', 'SC-FIM-classExprTypes-MGD', 'CG-2B', 'SC', 'CG-2B-MGD', 'CG-350M-classExprTypes-MGD', 'SC-FIM', 'TD-3', 'CG-350M-MGD', 'SC-FIM-MGD', 'SC-RLPG', 'CG-350M', 'CG-350M-classExprTypes', 'SC-classExprTypes-MGD', 'CG-6B-MGD', 'TD-3-MGD']
  • temperature: Temperature used for sampling. Values from: [0.8, 0.6, 0.4, 0.2]
  • model: Name of the model used for sampling. Values from: ['Salesforce/codegen-6B-multi', 'bigcode/santacoder', 'Salesforce/codegen-2B-multi', 'Salesforce/codegen-350M-multi', 'text-davinci-003']
  • context: Decoding strategy used. Values from: ['autoregressive', 'fim']
  • prefix: Prompt strategy used. Values from: ['classExprTypes', 'none', 'rlpg']
  • rlpg_best_rule_name: Name of the rule used for creating RLPG prompt (if used for the corresponding testcase). Values from: [nan, 'in_file#lines#0.25', 'in_file#lines#0.5', 'in_file#lines#0.75', 'import_file#method_names#0.5']
  • `outp
View on GitHub
GitHub Stars280
CategoryDevelopment
Updated26d ago
Forks34

Languages

Python

Security Score

100/100

Audited on Mar 1, 2026

No findings