InstructLLaMA

Implements pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF), to train and fine-tune the LLaMA2 model to follow human instructions, similar to InstructGPT or ChatGPT, but on a much smaller scale.

Check the article post on the discussion of the project.

Disclaimer

Project Purpose: This project is dedicated to research and education, focusing on the study of individual algorithms rather than the creation of a standard library. If you're looking for a ready-to-use library for production applications, this project may not be suitable for your needs.

Bug Reporting and Contributions: Although regular tests have been conducted, we cannot guarantee the code is bug-free. Bug reports and pull requests are highly encouraged and welcomed.

Optimization: Due to limited GPU resources, we only focus on training on a single GPU. Additionally, the hyper-parameters for the different training scripts are not fine-tuned.

Environment and Requirements

Python 3.10.6
PyTorch 2.1.1
Tensorboard 2.13.0
Bitsandbytes 0.41.3

Code Structure

instruct_llama directory contains main source code for the project.
- configs directory contains all the training configurations like model type, data source, number of iterations, learning rate etc.
- cores directory contains core modules like custom datasets, RL PPO agent etc.
- models directory contains the LLaMA model class and LoRA layers.
- utils directory contains helper modules like logging, checkpointing etc.
- run_pretrain.py run (full-scale) pre-training starting from a random model (supports FSDP and multiple GPUs).
- run_sft.py run (full-scale) supervised fine-tuning starting from pre-trained model(only supports single GPU).
- run_sft_lora.py basically the same as run_sft.py, but support 4-bit QLoRA (only supports single GPU).
- run_rm.py train reward model (full-scale) starting from supervised fine-tuning model (only supports single GPU).
- run_rlhf.py run (full-scale) RL PPO to train policy and value models, starting from supervised fine-tuning model and reward model respectively (supports allocate different models on different GPUs on a single machine, but no DDP or FSDP).
scripts directory contains all source code for convert the model weights and build datasets for different phases.
- build_pretrain_datasets.py build pre-train datasets (save the dataset in Numpy memmap structure), the dataset is optional during training using RLHF with PPO phase.
- build_finetune_datasets.py build fine-tuning datasets (save the dataset to .pkl files).
- build_rm_comparison_datasets.py build comparison datasets(save the dataset to .pkl files), which is used to train the reward model .
- build_rlhf_prompt_datasets.py build prompt only datasets (save the dataset to .pkl files), which is used during RLHF PPO fine-tuning phase.
- convert_meta_checkpoint.py convert Meta's pre-trained LLaMA-2 weights to support our model in plain PyTorch code, so we can load it to start fine-tuning.
- convert_lora_checkpoint.py convert fine-tunned LoRA weights to a full state_dict checkpoint.
inference directory contains the code to do inference, which was adapted from the original LLaMA2 project.
logs directory contains training logs for the different phases.

Project Setup

python3 -m pip install --upgrade pip setuptools

python3 -m pip install -r requirements.txt

Project Overview

Preparation

Here are the steps required to utilize the project:

Download the pre-trained model weights please refer to https://github.com/facebookresearch/llama on how to download it.
Convert Meta's pre-trained model weights so it's compatible with our naming convention. Remember to change the file path before running it.

python3 scripts/convert_meta_checkpoint.py

Download and build training datasets For each training stage, we need to use different datasets, we're already prepared some of the common datasets. However, some datasets are too big to upload, so you may need to re-build it. Here's an example of build fine-tuning datasets, remember to change the file path before running it.

python3 scripts/build_finetune_datasets.py

Training Stages

Run the run_pretrain.py script to train a LLaMA model from scratch. We only use it to train a 1B model so we can later use it as the reward model. Most of the time we would want to use Meta's pre-trained weights and skip this stage altogether.
Run the run_sft.py or run_sft_lora.py script to fine-tune the model, this requires a pre-trained model, such as the one from Meta or from above pretrain stage. Check and maintain the configuration inside configs/sft.py and configs/sft_lora.py if necessary.
Run the run_rm.py script to train a reward model, this requires a fine-tuned model. Check and maintain the configuration inside configs/rm.py if necessary.
Run the run_rlhf.py script to train a policy model using RLHF and PPO, this requires a fine-tuned model and the reward model (frozen). Check and maintain the configuration inside configs/rlhf.py if necessary.

It's important to mention that the training prompts in RLHF stage must coming from the same data distribution as the samples used to train the reward model. Otherwise, the reward signal would be noise and the model will collapse. In addition, the fine-tuning stage should also contain the same sample distribution, since we need the SFT model as a reference policy to compute KL penalties.

Stage 1 - Pre-training (full-scale FSDP)

This stage is when we turn a randomly initialized model into a one that can predict the next token (often called language modeling), this is often the most time and resource consuming phase. This requires a large amount of tokens and GPU power. Most of the time we'd want to use ready-to-use pre-trained model weights, for example from Meta.

It's important to mention, for pret-raining we can't use LoRA or QLoRA to reduce GPU resource. This is also the only stage (in this project) where we can utilize Pytorch FSDP and multiple GPUs to speed up training.

torchrun --nproc_per_node 1 instruct_llama/run_pretrain.py

Stage 2 - Supervised Fine-Tuning (full-scale or 4-bit QLoRA)

This stage is when we turn a pre-trained language model from predicting next token to answer general questions, in a chat formation. This is also referred as the prompt completion, where the model is feed a prompt (user request), and it needs to generate the corresponding completion (answer).

Once we have a pre-trained model and the fine-tuning datasets are ready, we can start doing supervised fine-tuning using full-scale. We start with Meta's pre-trained 7B model, the model context window is limited to 512, and use the hh-rlhf helpful-base dataset which consists of 41k samples to train the model over 2 epochs.

python3 instruct_llama/run_sft.py

We can also use the following script for single node with 4-bit QLoRA.

python3 instruct_llama/run_sft_lora.py

QLoRA Options

The training settings are in configs/sft_lora. These files lets us choose which layers to train and the quantization methods. Note if using 4-bit quantized linear layers, the training speed will slow down 50%~70%, depending on the configurations.

Merge LoRA weights

Since we're using LoRA method, when the training is done, we need to merge the LoRA weights with the pre-trained or fine-tuned model weights. Which can be summarized into the following steps:

Construct a model with LoRA layers, matching the configuration used in fine-tuning but without quantized layers.
Load the pre-trained or fine-tuned weights.
Load the LoRA weights
Set the model to evaluation mode (model.eval()) to merge the weights. This triggers the LoRALinear.train() method, and making the merging process.
Remove any LoRA parameters from the state dict
Save the merged checkpoint

You can use the following script to do the conversion, remember to update the file path in the script accordingly.

python3 scripts/convert_lora_checkpoint.py

Stage 3 - Train Reward Model (full-scale)

After the RM comparison datasets are ready, we can start training a reward model.

Training the reward model involves using a comparison dataset, which has the following structure (a single sample):

a single prompt -> a list of 2 or more completions, where the best one is at index 0, and the worst one is the last in the list

And the objective is to train the model to assign a higher reward to the best (chosen) completion and a lower reward to the worser (rejected) ones.

This training phase demands more GPU RAM compared to fine-tuning, as it involves maintaining multiple computation graphs/gradients during loss computation. The reward model can be initialized from the fine-tuned 7B model, or the pretrained 7B model, where the LM head is replaced with a linear layer that outputs a scalar score. This new scalar head is then jointly trained with the rest of decoder layers. To save computation, we have the option to freeze first N decoder layers.

We don't use LoRA when train the reward model since the scalar head layer is initialized randomly, full-scale training tends to yields better results. We start with Meta's pre-trained 7B model, the model context window is limited to 512, and use the hh-rlhf helpful-base dataset which consists of 41k samples to train the model over 1 epoch.

python3 instruct_llama/run_rm.py

Although the above script only supports training on a single GPU, it can be extended to support multiple GPUs and FSDP, please reference run_pretrain.py on how to do FSDP.

Stage 4 - RLHF with PPO (full-scale)

The last stage is to train the model using the PPO algorithm. This is the most complex part of the project, where it involves lots of mov

InstructLLaMA

Install / Use

README

InstructLLaMA

Disclaimer

Environment and Requirements

Code Structure

Project Setup

Project Overview

Preparation

Training Stages

Stage 1 - Pre-training (full-scale FSDP)

Stage 2 - Supervised Fine-Tuning (full-scale or 4-bit QLoRA)

QLoRA Options

Merge LoRA weights

Stage 3 - Train Reward Model (full-scale)

Stage 4 - RLHF with PPO (full-scale)