Lion: Adversarial Distillation of Proprietary Large Language Models (EMNLP 2023)

<a ><img src="pics/Lion.jpg" alt="Lion" style="width: 20%; min-width: 200px; display: block; margin: auto;"></a> <a href="https://arxiv.org/abs/2305.12870">[📄 Paper]</a> | <a href="https://huggingface.co/YuxinJiang/lion-7b">[🤗 Lion Weights]</a>  <hr>

News

[October 8, 2023] Our paper has been accepted to EMNLP 2023.
[June 10, 2023] We released insturctions for addressing OOM during fine-tuning, check it in Training Process.
[May 26, 2023] We released the model weights. Check out the 7B model!
[May 25, 2023] We released an online demo, try our model here!
[May 23, 2023] We released the code for training and inference.

Overview

Recovering Lion weights
Inference
Training Process
Evaluation
Citation
Disclaimer

Overview

The high-level overview of our adversarial distillation framework, where we craft a compact Student LLM based on a superior closed-source LLM that serves three roles: the Teacher, the Referee, and the Generator. From left to right, there are three stages in an iteration:

an imitation stage to align the student’s response with the teacher’s response;
a discrimination stage to identify hard samples;
a generation stage to produce new hard samples for escalating the challenges presented to the student model.

Recovering Lion weights

We release Lion weights as delta weights to comply with the LLaMA model license.

Lion-7B (delta weights)

You can add our delta to the original LLaMA weights to obtain the Lion weights. Instructions:

Get the original LLaMA weights in the huggingface format by following the instructions here
Please download our delta model from Hugging Face
Use the following scripts to get Lion weights by applying our delta:

python src/weight_diff.py recover --path_raw huggyllama/llama-7b --path_diff YuxinJiang/lion-7b --path_tuned <path_to_store_recovered_weights>

Inference

For inference and training of Lion, please first install the requirements:

pip install -r requirements.txt

We provide the decoding script for Lion, which reads a input file and generates corresponding responses for each sample, and finally consolidates them into an output file. It can be run on a single machine with 16GB GPU.

python src/lion_inference.py \
    --model_dir <path_to_hf_converted_lion_ckpt_and_tokenizer> \
    --data_dir <path_to_input_json_file> \
    --output_dir <path_to_output_json_file> \
    --num_gpus 1

Training Process

Below shows one iteration of our adversarial distillation framework.

1. Imitation Stage

1.1 Acquire the teacher's response on the Train Pool

python src/chatgpt_inference.py \
    -q <path_to_json_file_for_the_Train_Pool> \
    -o <path_to_chatgpt_inference_for_the_Train_Pool> \
    --api_key <your_openai_api_key>

1.2 Instruction-tuning the student based on the teacher’s response on the Train Pool

Fine-tuning was conducted on a machine with 8 A100 80G GPUs.

torchrun --nproc_per_node=8 --master_port=<your_random_port> src/train.py \
    --model_name_or_path <path_to_hf_converted_ckpt_and_tokenizer> \
    --data_path <path_to_chatgpt_inference_for_the_Train_Pool> \
    --bf16 True \
    --output_dir result \
    --num_train_epochs 3 \
    --model_max_length 1024 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 600 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True

Addressing OOM

Naively, fine-tuning a 7B model requires about 7 x 8 x 2 = 112 GB of VRAM. Commands given above enable parameter sharding, so no redundant model copy is stored on any GPU. If you'd like to further reduce the memory footprint, here are some options:

Turn on CPU offload for FSDP with --fsdp "full_shard auto_wrap offload". This saves VRAM at the cost of longer runtime.

In our experience, DeepSpeed stage-3 (with offload) can at times be more memory efficient than FSDP with offload. Here's an example to use DeepSpeed stage-3 with 8 GPUs with both parameter and optimizer offload:

deepspeed src/train_deepspeed.py \
    --model_name_or_path <path_to_hf_converted_ckpt_and_tokenizer> \
    --data_path <path_to_chatgpt_inference_for_the_Train_Pool> \
    --output_dir result \
    --num_train_epochs 3 \
    --model_max_length 1024 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 600 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --warmup_ratio 0.03 \
    --logging_steps 1 \
    --lr_scheduler_type "cosine" \
    --report_to "tensorboard" \
    --gradient_checkpointing True \
    --deepspeed srcs/configs/deepspeed_config.json \
    --fp16 True

The DeepSpeed library also provides some helpful functions to estimate memory usage.

LoRA fine-tunes low-rank slices of the query, key, and value embedding heads. This can reduce the total memory footprint from 112GB to about 7x4=28GB. We may release our re-implemention of this in the future, but for now the peft codebase can be a useful resource.

2. Discrimination Stage

2.1 Acquire the teacher's response on the Cache Pool

python src/chatgpt_inference.py \
    -q <path_to_json_file_for_the_Cache_Pool> \
    -o <path_to_chatgpt_inference_for_the_Cache_Pool> \
    --api_key <your_openai_api_key>

2.2 Acquire the student's response on the Cache Pool

python src/lion_inference.py \
    --model_dir <path_to_hf_converted_lion_ckpt_and_tokenizer> \
    --data_dir <path_to_json_file_for_the_Cache_Pool> \
    --output_dir <path_to_lion_inference_for_the_Cache_Pool> \
    --num_gpus 8

2.3 Ask the referee to output two scores according to the respose quality of the teacher and the student

To mitigate the position bias of the LLM referee, we conduct two runs by exchanging the positions of the teacher's response and the student's response.

python src/chatgpt_referee.py \
    -a <path_to_chatgpt_inference_for_the_Cache_Pool> <path_to_lion_inference_for_the_Cache_Pool> \
    -o <path_to_output_review_chatgpt_lion_file> \
    --api_key <your_openai_api_key>

python src/chatgpt_referee.py \
    -a <path_to_lion_inference_for_the_Cache_Pool> <path_to_chatgpt_inference_for_the_Cache_Pool> \
    -o <path_to_output_review_lion_chatgpt_file> \
    --api_key <your_openai_api_key>

2.4 Discriminate hard instructions and easy instructions

python src/discrimination.py \
    --review12_path <path_to_output_review_chatgpt_lion_file> \
    --review21_path <path_to_output_review_lion_chatgpt_file> \
    --chatgpt_inference_path <path_to_chatgpt_inference_for_the_Cache_Pool> \
    --lion_inferenc

Lion

Install / Use

README