SkillAgentSearch skills...

LLMBox

A comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation.

Install / Use

/learn @RUCAIBox/LLMBox
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

LLMBox | Training | Utilization

LLMBox

LLMBox is a comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation. LLMBox is designed to be a one-stop solution for training and utilizing LLMs. Through a practical library design, we achieve a high-level of flexibility and efficiency in both training and utilization stages.

<img style="display: block; margin: 25 auto;" src="docs/assets/llmbox.png" alt="" />

Key Features

Training

  • Diverse training strategies: We support multiple training strategies, including Supervised Fine-tuning (SFT), Pre-training (PT), PPO and DPO.
  • Comprehensive SFT datasets: We support 9 SFT datasets as the inputs for training.
  • Tokenizer Vocabulary Merging: We support the tokenizer merging function to expand the vocabulary.
  • Data Construction Strategies: We currently support merging multiple datasets for training. Self-Instruct and Evol-Instruct are also available to process the dataset.
  • Parameter Efficient Fine-Tuning: LoRA and QLoRA are supported in SFT or PT.
  • Efficient Training: We support Flash Attention and Deepspeed for efficient training.

Utilization

  • Blazingly Fast: By managing the KV Cache of prefixes or using vLLM, we can speed up local inference by up to 6x 🚀.
  • Comprehensive Evaluation: 59+ commonly used datasets and benchmarks in evaluating LLMs.
  • Evaluation Methods: 📏 Accurately reproduce results from original papers of OpenAI, LLaMA, Mistral, and other models.
  • In-Context Learning: We support various ICL strategies, including KATE, GlobalE, and APE.
  • Chain-of-Thought: For some datasets, we support three types of CoT evaluation: base, least-to-most, and pal.
  • Quantization: BitsAndBytes and GPTQ quantization are supported.
  • Easy To Use: Detailed results are provided for users to debug or integrate new models/datasets/cot.

Documentations

See documentations for more details.

Quick Start

Install

git clone https://github.com/RUCAIBox/LLMBox.git && cd LLMBox
pip install -r requirements.txt

If you are only evaluating the OpenAI (or OpenAI compatible like DeepSeek, Perplexity) models, you can install the minimal requirements requirements-openai.txt.

For installation problem, see trouble shooting.

<details> <summary><b>Update LLMBox</b></summary>

Currently, you can simply pull the latest repository from GitHub to update LLMBox.

git pull

If you are facing a merge conflict, please try to drop, stash, or commit your local changes first.

git checkout local_changes && git add -p && git commit -m "local changes"
git checkout main
git pull

The above commands show how to commit your local changes to a new branch, and then update the LLMBox.

</details>

Quick Start with Training

You can start with training a SFT model based on LLaMA-2 (7B) with deepspeed3:

cd training
bash download.sh
bash bash/run_ds3.sh

Quick Start with Utilization

To utilize your model, or evaluate an existing model, you can run the following command:

python inference.py -m gpt-3.5-turbo -d copa  # --num_shot 0 --model_type chat

This is default to run the OpenAI GPT 3.5 turbo model on the CoPA dataset in a zero-shot manner.

Training

LLMBox Training supports various training strategies and dataset construction strategies, along with some efficiency-improving modules. You can train your model with the following command:

python train.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --data_path data/ \
    --dataset alpaca_data_1k.json \
    --output_dir $OUTPUT_DIR \
    --num_train_epochs 2 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 2 \
    --save_strategy "epoch" \
    --save_steps 2 \
    --save_total_limit 2 \
    --learning_rate 1e-5 \
    --lr_scheduler_type "constant"

Alternatively, you can use the following preset bash scripts to train your model:

Merging Tokenizer

If you want to pre-train your models on corpora with languages or tokens not well-supported in original language mdoels(e.g., LLaMA), we provide the tokenizer merging function to expand the vocabulary based on the corpora by using sentencepiece. You can check merge_tokenizer.py for detailed information. Please follow the guide in Pre-train.

bash bash/run_7b_pt.sh

Merging Datasets

If you want to train your models with a mix of multiple datasets, you can pass a list of dataset files or names to LLMBox. LLMBox will transfer each file or name into a PTDataset or SFTDataset, and merge them together to construct a combined dataset. You can also set the merging ratio of each dataset by passing a list of floats to LLMBox. Please follow the guide in Merge Dataset.

bash bash/run_7b_hybrid.sh

Self-Instruct and Evol-Instruct

Since manually creating instruction data of high qualities to train the model is very time-consuming and labor-intensive, Self-Instruct and Evol-Instruct are proposed to create large amounts of instruction data with varying levels of complexity using LLM instead of humans. LLMBox support both Self-Instruct and Evol-Instruct to augment or enhance the input data files. Please follow the guide in Self-Insturct and Evol-Instruct

python self_instruct/self_instruct.py --seed_tasks_path=seed_tasks.jsonl

For more details, view the training documentation.

Utilization

We provide a broad support on Huggingface models (e.g. LLaMA-3, Mistral, or the model you are building on), OpenAI, Anthropic, QWen and other OpenAI-compatible models for further utilization. Full list of model backends: here.

Currently a total of 59+ commonly used datasets are supported, including: HellaSwag, MMLU, GSM8K, GPQA, AGIEval, CEval, and CMMLU. Full list of datasets: here.

CUDA_VISIBLE_DEVICES=0 python inference.py \
  -m llama-2-7b-hf \
  -d mmlu agieval:[English] \
  --model_type chat \
  --num_shot 5 \
  --ranking_type ppl_no_option
  • 🔥 Recently supported datasets: imbue_code, imbue_public, and imbue_private.

  • 🔥 See benchmarking LLaMA3 for more examples.

<table> <tr> <td colspan=4 align="center"><b>Performance</b></td> </tr> <tr> <td rowspan=2><b>Model</b></td> <td><code>get_ppl</code></td> <td><code>get_prob</code></td> <td><code>generation</code></td> </tr> <tr> <td><b>Hellaswag (0-shot)</b></td> <td><b>MMLU (5-shot)</b></td> <td><b>GSM (8-shot)</b></td> </tr> <tr> <td><b>GPT-3.5 Turbo</b></td> <td>79.98</td> <td>69.25</td> <td>75.13</td> </tr> <tr> <td><b>LLaMA-2 (7B)</b></td> <td>76</td> <td>45.95</td> <td>14.63</td> </tr> </table>

Efficient Evaluation

We by default enable prefix caching for efficient evaluation. vLLM is also supported.

<table> <tr> <td colspan=6 align="center"><b>Time</b></td> </tr> <tr> <td rowspan=2><b>Model</b></td> <td rowspan=2><b>Efficient Method</b></td> <td><code>get_ppl</code></td> <td><code>get_prob</code></td> <td><code>generation</code></td> </tr> <tr> <td><b>Hellaswag (0-shot)</b></td> <td><b>MMLU (5-shot)</b></td> <td><b>GSM (8-shot)</b></td> </tr> <tr> <td rowspan=3><b>LLaMA-2 (7B)</b></td> <td><b>Vanilla</b></td> <td>0:05:32</td> <td>0:18:30</td> <td>2:10:27</td> </tr> <tr> <td><b>vLLM</b></td> <td>0:06:37</td> <td>0:14:55</td> <td>0:03:36</td> </tr> <tr> <td><b>Prefix Caching</b></td> <td>0:05:48</td> <td>0:05:51</td> <td>0:17:13</td> </tr> </table>

You can also use the following command to use vllm:

python inference.py -m ../Llama-2-7b-hf -d mmlu:abstract_algebra,anatomy --vllm True  # --prefix_caching False --flash_attention False

To evaluate with quantization, you can use the following command:

python inference.py -m model -d dataset --load_in_4bits  # --load_in_8_bits or --gptq

Evaluation Method

Various types of evaluation methods are supported:

</br> <table> <tr> <td><b>Dataset</b></td> <td><b>Evaluation Method</b></td> <td><b>Instruction</b></td> </tr> <tr> <td><p><b>Generation</b></p> <p><pre><code>{ "question": "when was ...", "answer": [
View on GitHub
GitHub Stars853
CategoryDevelopment
Updated5d ago
Forks106

Languages

Python

Security Score

95/100

Audited on Mar 22, 2026

No findings