LLMBox
A comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation.
Install / Use
/learn @RUCAIBox/LLMBoxREADME
LLMBox | Training | Utilization
LLMBox
LLMBox is a comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation. LLMBox is designed to be a one-stop solution for training and utilizing LLMs. Through a practical library design, we achieve a high-level of flexibility and efficiency in both training and utilization stages.
<img style="display: block; margin: 25 auto;" src="docs/assets/llmbox.png" alt="" />Key Features
Training
- Diverse training strategies: We support multiple training strategies, including Supervised Fine-tuning (
SFT), Pre-training (PT),PPOandDPO. - Comprehensive SFT datasets: We support 9 SFT datasets as the inputs for training.
- Tokenizer Vocabulary Merging: We support the tokenizer merging function to expand the vocabulary.
- Data Construction Strategies: We currently support merging multiple datasets for training.
Self-InstructandEvol-Instructare also available to process the dataset. - Parameter Efficient Fine-Tuning:
LoRAandQLoRAare supported in SFT or PT. - Efficient Training: We support
Flash AttentionandDeepspeedfor efficient training.
Utilization
- Blazingly Fast: By managing the KV Cache of prefixes or using
vLLM, we can speed up local inference by up to 6x 🚀. - Comprehensive Evaluation: 59+ commonly used datasets and benchmarks in evaluating LLMs.
- Evaluation Methods: 📏 Accurately reproduce results from original papers of OpenAI, LLaMA, Mistral, and other models.
- In-Context Learning: We support various ICL strategies, including
KATE,GlobalE, andAPE. - Chain-of-Thought: For some datasets, we support three types of CoT evaluation:
base,least-to-most, andpal. - Quantization: BitsAndBytes and GPTQ quantization are supported.
- Easy To Use: Detailed results are provided for users to debug or integrate new models/datasets/cot.
Documentations
See documentations for more details.
Quick Start
Install
git clone https://github.com/RUCAIBox/LLMBox.git && cd LLMBox
pip install -r requirements.txt
If you are only evaluating the OpenAI (or OpenAI compatible like DeepSeek, Perplexity) models, you can install the minimal requirements requirements-openai.txt.
For installation problem, see trouble shooting.
<details> <summary><b>Update LLMBox</b></summary>Currently, you can simply pull the latest repository from GitHub to update LLMBox.
git pull
If you are facing a merge conflict, please try to drop, stash, or commit your local changes first.
git checkout local_changes && git add -p && git commit -m "local changes"
git checkout main
git pull
The above commands show how to commit your local changes to a new branch, and then update the LLMBox.
</details>Quick Start with Training
You can start with training a SFT model based on LLaMA-2 (7B) with deepspeed3:
cd training
bash download.sh
bash bash/run_ds3.sh
Quick Start with Utilization
To utilize your model, or evaluate an existing model, you can run the following command:
python inference.py -m gpt-3.5-turbo -d copa # --num_shot 0 --model_type chat
This is default to run the OpenAI GPT 3.5 turbo model on the CoPA dataset in a zero-shot manner.
Training
LLMBox Training supports various training strategies and dataset construction strategies, along with some efficiency-improving modules. You can train your model with the following command:
python train.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--data_path data/ \
--dataset alpaca_data_1k.json \
--output_dir $OUTPUT_DIR \
--num_train_epochs 2 \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 2 \
--save_strategy "epoch" \
--save_steps 2 \
--save_total_limit 2 \
--learning_rate 1e-5 \
--lr_scheduler_type "constant"
Alternatively, you can use the following preset bash scripts to train your model:
Merging Tokenizer
If you want to pre-train your models on corpora with languages or tokens not well-supported in original language mdoels(e.g., LLaMA), we provide the tokenizer merging function to expand the vocabulary based on the corpora by using sentencepiece. You can check merge_tokenizer.py for detailed information. Please follow the guide in Pre-train.
bash bash/run_7b_pt.sh
Merging Datasets
If you want to train your models with a mix of multiple datasets, you can pass a list of dataset files or names to LLMBox. LLMBox will transfer each file or name into a PTDataset or SFTDataset, and merge them together to construct a combined dataset. You can also set the merging ratio of each dataset by passing a list of floats to LLMBox. Please follow the guide in Merge Dataset.
bash bash/run_7b_hybrid.sh
Self-Instruct and Evol-Instruct
Since manually creating instruction data of high qualities to train the model is very time-consuming and labor-intensive, Self-Instruct and Evol-Instruct are proposed to create large amounts of instruction data with varying levels of complexity using LLM instead of humans. LLMBox support both Self-Instruct and Evol-Instruct to augment or enhance the input data files. Please follow the guide in Self-Insturct and Evol-Instruct
python self_instruct/self_instruct.py --seed_tasks_path=seed_tasks.jsonl
For more details, view the training documentation.
Utilization
We provide a broad support on Huggingface models (e.g. LLaMA-3, Mistral, or the model you are building on), OpenAI, Anthropic, QWen and other OpenAI-compatible models for further utilization. Full list of model backends: here.
Currently a total of 59+ commonly used datasets are supported, including: HellaSwag, MMLU, GSM8K, GPQA, AGIEval, CEval, and CMMLU. Full list of datasets: here.
CUDA_VISIBLE_DEVICES=0 python inference.py \
-m llama-2-7b-hf \
-d mmlu agieval:[English] \
--model_type chat \
--num_shot 5 \
--ranking_type ppl_no_option
-
🔥 Recently supported datasets:
imbue_code,imbue_public, andimbue_private. -
🔥 See benchmarking LLaMA3 for more examples.
Efficient Evaluation
We by default enable prefix caching for efficient evaluation. vLLM is also supported.
<table> <tr> <td colspan=6 align="center"><b>Time</b></td> </tr> <tr> <td rowspan=2><b>Model</b></td> <td rowspan=2><b>Efficient Method</b></td> <td><code>get_ppl</code></td> <td><code>get_prob</code></td> <td><code>generation</code></td> </tr> <tr> <td><b>Hellaswag (0-shot)</b></td> <td><b>MMLU (5-shot)</b></td> <td><b>GSM (8-shot)</b></td> </tr> <tr> <td rowspan=3><b>LLaMA-2 (7B)</b></td> <td><b>Vanilla</b></td> <td>0:05:32</td> <td>0:18:30</td> <td>2:10:27</td> </tr> <tr> <td><b>vLLM</b></td> <td>0:06:37</td> <td>0:14:55</td> <td>0:03:36</td> </tr> <tr> <td><b>Prefix Caching</b></td> <td>0:05:48</td> <td>0:05:51</td> <td>0:17:13</td> </tr> </table>You can also use the following command to use vllm:
python inference.py -m ../Llama-2-7b-hf -d mmlu:abstract_algebra,anatomy --vllm True # --prefix_caching False --flash_attention False
To evaluate with quantization, you can use the following command:
python inference.py -m model -d dataset --load_in_4bits # --load_in_8_bits or --gptq
Evaluation Method
Various types of evaluation methods are supported:
</br> <table> <tr> <td><b>Dataset</b></td> <td><b>Evaluation Method</b></td> <td><b>Instruction</b></td> </tr> <tr> <td><p><b>Generation</b></p> <p><pre><code>{ "question": "when was ...", "answer": [