CodeGeeX
CodeGeeX: An Open Multilingual Code Generation Model (KDD 2023)
Install / Use
/learn @zai-org/CodeGeeXREADME
🌟 The newest CodeGeeX4 has been released. | 最新一代 CodeGeeX4 模型已经正式开源。
- CodeGeeX: A Multilingual Code Generation Model
CodeGeeX: A Multilingual Code Generation Model
We introduce CodeGeeX, a large-scale multilingual code generation model with 13 billion parameters, pre-trained on a large code corpus of more than 20 programming languages. As of June 22, 2022, CodeGeeX has been trained on more than 850 billion tokens on a cluster of 1,536 Ascend 910 AI Processors. CodeGeeX has several unique features:
- Multilingual Code Generation: CodeGeeX has good performance for generating executable programs in several mainstream programming languages, including Python, C++, Java, JavaScript, Go, etc. DEMO
- Crosslingual Code Translation: CodeGeeX supports the translation of code snippets between different languages. Simply by one click, CodeGeeX can transform a program into any expected language with a high accuracy. DEMO
- Customizable Programming Assistant: CodeGeeX is available in the VS Code extension marketplace for free. It supports code completion, explanation, summarization and more, which empower users with a better coding experience. VS Code Extension
- Open-Source and Cross-Platform: All codes and model weights are publicly available for research purposes. CodeGeeX supports both Ascend and NVIDIA platforms. It supports inference in a single Ascend 910, NVIDIA V100 or A100. Apply Model Weights
HumanEval-X for Realistic Multilingual Benchmarking. To help standardize the evaluation of multilingual code generation and translation, we develop and release the HumanEval-X Benchmark. HumanEval-X is a new multilingual benchmark that contains 820 human-crafted coding problems in 5 programming languages (Python, C++, Java, JavaScript, and Go), each of these problems is associated with tests and solutions. Usage 🤗 Available in HuggingFace
<img src="resources/en/hx_boxplot.png"> <p align="center"><i>CodeGeeX achieves the highest average performance compared with other open-sourced multilingual baselines.</i> </p>News
-
🌟 2023-07-24: CodeGeeX2 has been released, more powerful, faster, and lightweight. Support 100+ languages and many new features.
-
2023-5-16: CodeGeeX paper has been accepted by KDD 2023, Long Beach and will be represented during the conference.
-
2023-03-30: CodeGeeX paper is now available at arxiv.
-
2023-02-14: CodeGeeX now supports Cloud Studio, a fantastic web IDE from Tencent. Click on the badge on top of this page to quickly launch an environment to test CodeGeeX.
-
2023-02-13: Thanks a lot to OneFlow team for adding oneflow backend for CodeGeeX's inference (Even faster than FasterTransformer under FP16!). Check more details here.
-
2023-02: We are hosting CodeGeeX "Coding With AI" Hackathon, design cool applications based on CodeGeeX and win prizes (RTX 4090, DJI drone, etc)!
-
2022-12-31: We release the FasterTransformer version of CodeGeeX in codegeex-fastertransformer. The INT8 accelerated version reaches an a verage speed of <15ms/token. Happy new year to everyone!
-
2022-12-13: We release the source code of CodeGeeX VS Code extension in codegeex-vscode-extension. Follow QuickStart to start development.
-
2022-12-11: CodeGeeX is now available for Jetbrains IDEs (IntelliJ IDEA, PyCharm, GoLand, CLion, etc), download it here.
-
2022-12-04: We release source code of quantization (requires less GPU RAM: 27GB -> 15GB) and model parallelism (possible to run on multiple GPUs with <8G RAM).
-
2022-09-30: We release the cross-platform source code and models weights for both Ascend and NVIDIA platforms.
Getting Started
CodeGeeX is initially implemented in Mindspore and trained Ascend 910 AI Processors. We provide a torch-compatible version based on Megatron-LM to facilitate usage on GPU platforms.
Installation
Python 3.7+ / CUDA 11+ / PyTorch 1.10+ / DeepSpeed 0.6+ are required. Install codegeex package via:
git clone git@github.com:THUDM/CodeGeeX.git
cd CodeGeeX
pip install -e .
Or use CodeGeeX docker to quickly set up the environment (with nvidia-docker installed):
docker pull codegeex/codegeex:latest
# To enable GPU support, clarify device ids with --device
docker run --gpus '"device=0,1"' -it --ipc=host --name=codegeex codegeex/codegeex
Model Weights
Apply and download model weights through this link. You'll receive by mail urls.txt that contains temporary download links. We recommend you to use aria2 to download it via the following command (Please make sure you have enough disk space to download the checkpoint (~26GB)):
aria2c -x 16 -s 16 -j 4 --continue=true -i urls.txt
Run the following command to get the full model weights:
cat codegeex_13b.tar.gz.* > codegeex_13b.tar.gz
tar xvf codegeex_13b.tar.gz
Inference on GPUs
Have a try on generating the first program with CodeGeeX. First, specify the path of the model weights in configs/codegeex_13b.sh. Second, write the prompt (natural language description or code snippet) into a file, e.g., tests/test_prompt.txt, then run the following script:
# On a single GPU (with more than 27GB RAM)
bash ./scripts/test_inference.sh <GPU_ID> ./tests/test_prompt.txt
# With quantization (with more than 15GB RAM)
bash ./scripts/test_inference_quantized.sh <GPU_ID> ./tests/test_prompt.txt
# On multiple GPUs (with more than 6GB RAM, need to first convert ckpt to MP_SIZE partitions)
bash ./scripts/convert_ckpt_parallel.sh <LOAD_CKPT_PATH> <SAVE_CKPT_PATH> <MP_SIZE>
bash ./scripts/test_inference_parallel.sh <MP_SIZE> ./tests/test_prompt.txt
VS Code and Jetbrains Extension Guidance
Based on CodeGeeX, we also develop free extentions for VS Code and Jetbrains IDEs, and more in the future.
For VS Code, search "codegeex" in Marketplace or install it here. Detailed instructions can be found in VS Code Extension Guidance. For developers, we have also released the source code in codegeex-vscode-extension, please follow QuickStart to start development.
For Jetbrains IDEs, search "codegeex" in Plugins or install it here. Make sure your IDE version is 2021.1 or later. CodeGeeX now supports IntelliJ IDEA, PyCharm, GoLand, CLion, Android Studio, AppCode, Aqua, DataSpell, DataGrip, Rider, RubyMine, and WebStorm.
CodeGeeX: Architecture, Code Corpus, and Implementation
**Arc
