<a href="https://arxiv.org/pdf/2306.00029.pdf">Technical Report</a>, <a href="https://opensource.salesforce.com/CodeTF/latest/index.html">Documentation</a>, <a href="https://github.com/salesforce/CodeTF/tree/main/test_inference">Examples</a>,

CodeTF - A One-stop Transformer Library for State-of-the-art Code LLM

</div>

Introduction
Installation
Getting Started
Ethical and Responsible Use
License

Introduction

CodeTF is a one-stop Python transformer-based library for code large language models (Code LLMs) and code intelligence, provides a seamless interface for training and inferencing on code intelligence tasks like code summarization, translation, code generation and so on. It aims to facilitate easy integration of SOTA CodeLLMs into real-world applications.

In addition to the core LLMs's features for code, CodeTF offers utilities for code manipulation across various languages, including easy extraction of code attributes. Using tree-sitter as its core AST parser, it enables parsing of attributes such as function names, comments, and variable names. Pre-built libraries for numerous languages are provided, eliminating the need for complicated parser setup. CodeTF thus ensures a user-friendly and accessible environment for code intelligence tasks.

The current version of the library offers:

Fast Model Serving: We support an easy-to-use interface for rapid inferencing with pre-quantized models (int8, int16, float16). CodeTF handles all aspects of device management, so users do not have to worry about that aspect. If your model is large, we offer advanced features such as weight sharding across GPUs to serve the models more quickly.
Fine-Tuning Your Own Models: We provide an API for quickly fine-tuning your own LLMs for code using SOTA techniques for parameter-efficient fine-tuning (HuggingFace PEFT) on distributed environments.
Supported Tasks: nl2code, code summarization, code completion, code translation, code refinement, clone detection, defect prediction.
Datasets+: We have preprocessed well-known benchmarks (Human-Eval, MBPP, CodeXGLUE, APPS, etc.) and offer an easy-to-load feature for these datasets.
Model Evaluator: We provide interface to evaluate models on well-known benchmarks (e.g. Human-Eval) on popular metrics (e.g., pass@k) with little effort (~15 LOCs).
Pretrained Models: We supply pretrained checkpoints of state-of-the-art foundational language models of code (CodeBERT, CodeT5, CodeGen, CodeT5+, Incoder, StarCoder, etc.).
Fine-Tuned Models: We furnish fine-tuned checkpoints for 8+ downstream tasks.
Utility to Manipulate Source Code: We provide utilities to easily manipulate source code, such as user-friendly AST parsers (based on tree-sitter) in 15+ programming languages, to extract important code features, such as function name, identifiers, etc.

The following table shows the supported models with sizes and the tasks that the models support. This is a continuing effort and we are working on further growing the list.

| Model | Size | Tasks | |--------------|-------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------| | CodeT5 | Base, Base-multi-sum, Base-translate-cs, Base-translate-java, Base-sum, Base-clone, Base-defect | Pretrained, NL to Code, Refine, Translation (CS to Java, Java to CS), Summarization (Python, Go, PHP, JavaScript, Java, Ruby), Clone detection, Defect prediction | | CodeT5+ | Plus-instruct-16B, Plus-16B, Plus-6B, Plus-2B, Plus-770M-python, Plus-770M, Plus-220M | Pretrained, NL to Code, Refine , Defect prediction | | CodeGen | Mono: 350M, 2B, 6B, 1B, 3.7B, 7B, 16B<br>Multi: 350M, 2B, 6B<br>NL: 350M, 2B | Pretrained | | StarCoder | 15.5B | Pretrained | | SantaCoder | 1.1B | Pretrained | | GPT-NeoX | 20B | Pretrained | | GPT-Neo | 1.3B | Pretrained | | GPT-J | 6B | Pretrained | | Incoder | 6B | Pretrained | | CodeParrot | Small-python (110M), Small-multi(110M), 1.5B | Pretrained | | CodeBERT | CodeBERT-base, UnixCoder-base, CodeBERTa-small | Pretrained |

Installation Guide

(Optional) Creating conda environment

conda create -n codetf python=3.8
conda activate codetf

Install from PyPI:

pip install salesforce-codetf

Alternatively, build CodeTF from source:

git clone https://github.com/salesforce/CodeTF.git
cd CodeTF
pip install -e .

Additionally, to make sure the quantization feature works well, also install these dependencies:

pip install -q -U git+https://github.com/huggingface/transformers.git
pip install -q -U git+https://github.com/huggingface/peft.git
pip install -q -U git+https://github.com/huggingface/accelerate.git

For some models, such as StarCoder, it is required to log in Huggingface. Please obtain the HuggingFace token and login:

huggingface-cli login

Getting Started

Inferencing Pipeline

Getting started with CodeTF is simple and quick with our model loading pipeline function load_model_pipeline(). Here's an example showing how to load codet5+ model and perform inference on code generation task:

from codetf.models import load_model_pipeline

code_generation_model = load_model_pipeline(model_name="codet5", task="pretrained",
            model_type="plus-770M-python", is_eval=True,
            load_in_8bit=True, load_in_4bit=False, weight_sharding=False)
            
result = code_generation_model.predict(["def print_hello_world():"])
print(result)

There are a few notable arguments that need to be considered:

model_name: the name of the model, currently support codet5 and causal-lm.
model_type: type of model for each model name, e.g. base, codegen-350M-mono, j-6B, etc.
load_in_8bit and load_in_4bit: inherit the dynamic quantization feature from Huggingface Quantization.
weight_sharding: our advance feature that leverages HuggingFace Sharded Checkpoint to split a large model in several smaller shards in different GPUs. Please consider using this if you are dealing with large models.

Model Zoo

You might want to view all of the supported models. To do this, you can use the model_zoo():

from codetf.models import model_zoo
print(model_zoo)
# ============================================================================================================
# Architectures                  Types                           Tasks
# ======================================================================

CodeTF

Install / Use

README