CodeCapybara
Open-source Self-Instruction Tuning Code LLM
Install / Use
/learn @FSoft-AI4Code/CodeCapybaraREADME
CodeCapybara: Open Source LLaMA Model that Follow Instruction-Tuning for Code Generation.
We introduce CodeCapybara - A Code specialized Instruction-following Large Language Model. This repo also attempts to evaluate and reproduce performance results of existing LLMs for code, such as Llama, Alpaca and CodeAlpaca for code generation benchmarks (HumanEval and MBPP).
- First attempt to reproduce of LLaMA results on widely recognized Code Generation benchmarks
- CodeCapybara is fine-tuned from Llama 7B. Larger models will be available soon. You can find our checkpoints at this.
- We use our own dataset in larger scale and more diverse to fine-tune Llama under an instruction-tuning style.
- Improved evaluation results on HumanEval in comparison to LLaMA, Alpaca and CodeAlpaca.
- Full transparency with open source availability: all scripts and models are accessible to the community. We encourage you to contribute to CodeCapybara and help advance the field of code generation.
Table of Contents
- CodeCapybara: Open Source LLaMA Model that Follow Instruction-Tuning for Code Generation.
Overview
We follow several recent techniques of instruction tuning to collect data and train an instruction-following model with ability to generate executable code from human language description.
We can divide our process for training CodeyCapybara into two stages:
- Data Collection: We collect data generated through OpenAI
gpt-3.5-turboas well as code generation supervised dataset. - Instruction Tuning: We fine-tune our model from MetaAI's LLaMA checkpoint with parameter-efficient fine-tuning methods.
Data Collection
In this stage, we follow previous works to collect instruction data. To ensure the quality of the code data used in the fine-tuning stage, we make some modifications from data Self-Instruct data generation procedure.
<!-- | Data source | No. samples | |-|-| |Only Instruction Generation| 20,574| |CodeAlpaca| 20,022 | |DeepMind's Code Contests| 13,328 | | **Total**| **53,924**| -->Only Instruction Generation
To ensure the code quality for later use as targets in the fine-tuning step, we leverage an unsupervised dataset that only contains code snippets crawled from open-sources. We then design a prompt to ask gpt-3.5-turbo to generate a corresponding instruction for each code snippet. In other words, to obtain a pair (instruction-output), we ask gpt-3.5-turbo to generate the instruction given the output as human written code snippet.
Our unsupervised dataset contains code functions that covers a wide range of programming problem in 10 programming languages, i.e Python, Javascript, Java, Golang, Ruby, Rust, PHP, C, C++, C#
We obtain our dataset through gpt-3.5-turbo OpenAI API. Each instruction-output pair is generated through 2 rounds of API calling.
-
In 1st round, we include a code function (i.e output) in the prompt, and ask
gpt-3.5-turboto generate a corresponding instruction. -
In 2nd round, since the code function does not guarantee an executable program, we include both 1st round generated instruction and code function to a new prompt and ask the model to generate an executable program with libraries imported and dependencies implementation along with the given code function.
-
Our prompt template can be found here.
-
Our script for 2 rounds of data generation can be found here.
Code Alpaca
For the second source of data, our intention is to follow Self-Instruct paper to completely generate various code problems in the format of (Instruction-Input-Output) data from a seed dataset.
We reuse the generated instruction data from Code Alpaca to reduce API calling cost since what they did is similar to our purpose.
DeepMind's Code Contests
We also leverage the supervised code generation dataset. There are various code generation dataset with high quality and quantity, such as APPS (5,000 problems in train split), MBPP (500 problems in train split).
In this version, we select DeepMind's Code Contests dataset, which contains competitive programming problems with detailed description and test cases. The train split we employ to fine-tune our model contains 13,328 problems which results in 51,766 instruction-output pairs.
Instruction Tuning
We tried 2 approaches to fine-tune LLaMA-7B checkpoint on the collected data, including:
- Full-parameter Fine-tuning
- Parameter-efficient Fine-tuning with HuggingFace's PEFT
Please refer to Checkpoint Release section for accessing to our checkpoints.
Results
We evaluate our models as well as reproduce other models' results on 2 benchmarks, HumanEval and MBPP. All numbers are reported in zero-shot settings.
HumanEval Results
| Model |Base checkpoint | pass@1 | pass@10 | pass@100 | | - | - | - | - | - | | LLaMA | decapoda-research/llama-7b-hf | 10.70| 13.29 | 13.41 | | LLaMA | huggyllama/llama-7b |9.7 | 12.66| 12.80 | | Alpaca-LoRA | decapoda-research/llama-7b-hf | 8.00 | 10.00 | 10.37| | CodeCapybara-LoRA | decapoda-research/llama-7b-hf | 9.61 | 11.62 | 12.02 | | CodeCapybara | huggyllama/llama-7b | 11.10 | 13.33 | 13.41 |
MBPP Results
Data Release
We release our data as well as other data sources used for training our models
- Our Instruction Only Generation data
- Code Apaca data
- Deepmind's CodeContests hosted on HuggingFace
Checkpoint Release
We release our checkpoints hosted on HuggingFace
- CodeCapybara - Full-parameter Fine-tuning
- CodeCapypara-LoRA - Parameter-efficient Fine-tuning
Installation
conda create -n codecapybara -y
conda activate codecapybara
conda install pip -y
pip install -r requirements.txt
Usage
Let's define a function to convert instruction and input into a single prompt as input to our model.generate
def generate_prompt(instruction, input=None):
# Templates used by Stanford Alpaca: https://github.com/tatsu-lab/stanford_alpaca
if input is not None:
prompt = f"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
else:
prompt = f"prompt_no_input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"
return prompt
Loading model
You can choose to load full-parameter CodeCapybara or CodeCapybara-LoRA
Loading CodeCapybara
import sys
import torch
from transformers import LlamaTokenizer, LlamaForCausalLM
tokenizer = LlamaTokenizer.from_pretrained("Fsoft-AIC/CodeCapybara")
model = LlamaForCausalLM.from_pretrain("Fsoft-AIC/CodeCapybara",
load_in_8bit=True,
dtype=torch.float16,
device_map="auto")
model.config.pad_token_id = tokenizer.pad_token_id = 0
model.config.bos_token_id = 1
model.config.eos_token_id = 2
model.eval()
if torch.__version__ >= "2" and sys.platform != "win32":
model = torch.compile(model)
Loading CodeCapybara-LoRA
import sys
import torch
from transformers import LlamaTokenizer, LlamaForCausalLM
from peft import PeftModel
tokenizer = LlamaTokenizer.from_pretrained("d
