RLHF
Implementation of Chinese ChatGPT
Install / Use
/learn @sunzeyeah/RLHFREADME
Features
提供3大功能:
- LLM模型预训练:支持常见模型的预训练,包括:decoder结构(LLaMA、GPT)、encoder结构(GLM)
- LLM模型评测:参考GPT类模型,基于ZeroShot和FewShot实现
- ChatGPT模型训练pipeline:根据Learning to Summarize from human feedback ,实现3大流程: SFT、Reward Model和RLHF
- 支持RLHF阶段 (1) 联合优化reward和policy (2) 单独优化policy,冻结reward
- 支持DPO作为Reward+RLHF的替代方案,可显著降低显存占用,同时实现RL的效果
Setup
1. Install deepspeed
git clone https://github.com/microsoft/DeepSpeed.git
cd deepspeed
rm -rf build
TORCH_CUDA_ARCH_LIST="7.0" DS_BUILD_OPS=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 | tee build.log
如果想创建binary wheel,方便在其他机器上安装,可使用如下命令,会在dist目录生成类似可安装文件deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl
git clone https://github.com/microsoft/DeepSpeed.git
cd deepspeed
rm -rf build
TORCH_CUDA_ARCH_LIST="7.0" DS_BUILD_OPS=1 python setup.py build_ext -j8 bdist_wheel 2>&1 | tee build.log
PS:需要根据下图,调整TORCH_CUDA_ARCH_LIST="7.0"为自己对应的NVIDIA GPU架构

或运行torch.cuda.get_device_capability()获取自己GPU的架构
2. Install jieba
在使用Pangu类模型的时候,其special_token格式为<sep>、<pad>等,而tokenization_gptpangu.py中tokenize()函数会使用jieba进行分词。但直接pip install jieba,默认会将<和>直接切分开,使用jieba.add_word("<sep>")也没有作用,因为jieba直接hardcode了会自动切分的token,其中就包括了<和>。
因此需要执行:
git clone https://github.com/fxsjy/jieba.git
cd jieba
将代码clone到本地,修改jieba/__init__.py中re_han_default的取值,具体改动如下:
- 改动前:
re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-]+)", re.U)
- 改动后:
re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-<>]+)", re.U)
修改完成后使用pip install .进行本地编译安装,替换原有jieba。安装完成后,在代码中加入jieba.add_word("<sep>")(该代码已加入tokenization_gptpangu.py),即可解决将<sep>一类的special token切分为多个id的情况
3. Install apex (Optional)
git clone https://github.com/NVIDIA/apex
cd apex
pip install --global-option="--cpp_ext" --global-option="--cuda_ext" --no-cache -v --disable-pip-version-check . 2>&1 | tee build.log
如果想创建binary wheel,方便在其他机器上安装,可使用如下命令,会在dist目录生成类似可安装文件apex-0.0.1+7150e20-cp38-cp38-linux_x86_64.whl
git clone https://github.com/NVIDIA/apex
cd apex
python setup.py --cpp_ext --cuda_ext bdist_wheel 2>&1 | tee build.log
Data & Model Download
1. 预训练模型下载
| 模型 | size | huggingface地址 | 百度网盘地址 | 提取码 | | ----------- | ----------- | ----------- | ----------- | ----------- | | Pangu-350M | 659MB | sunzeyeah/pangu-350M | Pangu-350M | c5jj | | Pangu-2.6B | 9.8GB | sunzeyeah/pangu-2_6B | Pangu-2.6B | 2rad | | Pangu-13B | 23.6GB | sunzeyeah/pangu-13B | Pangu-13B | u3dx | | GLM-350M-chinese | 679MB | sunzeyeah/glm-350M-chinese | GLM-350M-chinese | ii8e | | GLM-10B-chinese | 18.4G | sunzeyeah/glm-10B-chinese | GLM-10B-chinese | fynj | | ChatGLM-6B | 25.6G | sunzeyeah/chatglm-6B | ChatGLM-6B | uq1k |
PS: 本repo提供的预训练模型下载中,
- 对于pytorch_model*.bin
- 如果源文件已包括,则不做改动
- 如果源文件不包括,则根据其提供的checkpoint转换为pytorch_model*.bin
- 其余文件可能相对原文件有改动,包括:modeling_*.py、tokenization_*.py、configuration_*.py、config.json和tokenizer.config
2. 数据下载
| 数据集 | size | huggingface地址 | 百度网盘地址 | 提取码 | | ----------- | ----------- | ----------- | ----------- | ----------- | | CLUE Benchmark | 500MB | | CLUE Benchmark | m6gt | | SFT & Reward Data | 5GB | sunzeyeah/chinese_chatgpt_corpus | SFT & Reward Data | ecyc | | 百科 | 652MB | | baike_qa_2019 | 7jad | | 知道问答 | 847MB | | zhidao | neds | | 对联 | 221MB | | couplets | 54ey | | 古文 | 125MB | | Classical & Modern | a4cr | | 古诗词 | 87MB | | chinese poetry | 5zzj | | 微博新闻评论 | 522MB | | weibo summary comments | w0g1 |
PS: SFT & Reward Data基于百科、知道问答、对联、古文、古诗词、微博新闻评论数据构造,可直接用于SFT和Reward阶段训练。详见data_prepare.py
Usage
1. LLM模型预训练
对开源LLM进行增量预训练,基于deepspeed实现。目前支持2类模型架构:
- decoder结构:LLaMA、Baichuan、Pangu
- encoder结构:GLM、ChatGLM
cd examples
bash pretrain.sh
2. LLM模型评测
对开源中文LLM进行ZeroShot、OneShot或FewShot的评测。详见eval_pretrain.py 和 data.py。
目前支持的评测任务:
- C-Eval
- MMLU
- CLUEBenchmark :评测方法和prompt模板参考Pangu-alpha论文
目前支持的开源模型:
- LLaMA及相关衍生模型
- ChatGLM(1和2)
- Baichuan
- Qwen
- Pangu
- GLM
cd examples
bash eval_pretrain.sh
3. SFT
使用开源LLM + SFT&Reward数据进行SFT训练
cd examples
bash train_sft.sh
4. Reward Model
使用SFT模型 + SFT&Reward数据进行Reward模型训练
cd examples
bash train_reward.sh
5. RLHF
利用PPO算法和Reward Model,进一步更新SFT模型。基于开源框架DeepSpeedChat 实现
cd examples
bash train_rlhf.sh
6. DPO
利用DPO算法替代Reward+RLHF的pipeline,免去训练Reward模型,同时达到RL训练的效果,该方法可显著降低显存占用。基于开源框架trl 实现
cd examples
bash train_dpo.sh
Results
1. LLM模型评测
<details> <summary><b>C-Eval 5-shot测试集(test)结果</b></summary> <table> <tr> <td>Model</td> <td>Avg</td> <td>Avg(Hard)</td> <td>STEM</td> <td>Social Science</td> <td>Humanities</td> <td>Other</td> </tr> <tr> <td>Baichuan2-13B-Chat</td> <td style="color:red"><b>56.30</b></td> <td>34.20</td> <td style="color:red"><b>48.20</b></td> <td style="color:red"><b>70.00</b></td> <td style="color:red"><b>60.50</b></td> <td>54.20</td> </tr> <tr> <td>xverse-13B</td> <td>55.30</td> <td>32.50</td> <td>45.90</td> <td>66.70</td> <td>59.50</td> <td style="color:red"><b>57.60</b></td> </tr> <tr> <td>Qwen-7B-Chat</td> <td>54.70</td> <td>35.40</td> <td>47.90</td> <td>68.30</td> <td>58.70</td> <td>50.00</td> </tr> <tr> <td>Baichuan-13B-Base</td> <td >53.70</td> <td style="color:red"><b>35.60</b></td> <td>46.80</td> <td>65.80</td> <td>58.00</td> <td>50.80</td> </tr> <tr> <td>Baichuan2-7B-Chat</td> <td>52.50</td> <td>33.80</td> <td>45.70</td> <td>64.20</td> <td>56.60</td> <td>50.20</td> </tr> <tr> <td>ChatGLM2-6B</td> <td>51.20</td> <td>33.40</td> <td>46.90</td> <td>63.00</td> <td>51.60</td> <td>47.70</td> </tr> <tr> <td>Baichuan-13B-Chat</td> <td>47.90</td> <td>31.50</td> <td>41.40</td> <td>56.80</td> <td>53.00</td> <td>46.50</td> </tr> <tr> <td>Baichuan-7B</td> <td>44.20</td> <td>31.70</td> <td>39.20</td> <td>53.30</td> <td>47.30</td> <td>41.90</td> </tr> <tr> <td>Ziya-LLaMA-13B-v1.1</td> <td>40.10</td> <td>30.30</td> <td>35.80</td> <td>47.30</td> <td>42.80</td> <td>38.50</td> </tr> <tr> <td>ChatGLM1.1-6B</td> <td>38.10</td> <td>28.60</td> <td>33.60</td> <td>46.70</td> <td>40.90</td> <td>35.70</td> </tr> <tr> <td>AtomGPT-13B-56k</td> <td>37.60</td> <td>25.30</td> <td>32.00</td> <td>44.70</td> <td>42.80</td> <td>36.10</td> </tr> <tr> <td>LLaMA2-13B-chat</td> <td>37.10</td> <td>29.30</td> <td>34.60</td> <td>43.60</td> <td>35.90</td> <td>37.00</td> </tr> <tr> <td>ChatGLM-6B</td> <td>36.30</td> <td>27.20</td> <td>32.90</td> <td>42.80</td> <td>38.10</td> <td>34.90</td> </tr> <tr> <td>LLaMA-30B</td> <td>35.90</td> <td>29.90</td> <td>34.40</td> <td>42.40</td> <td>33.30</td> <td>35.60</td> </tr> <tr> <td>LLaMA2-7B-chat</td> <td>33.50</td> <td>27.30</td> <td>31.60</td> <td>38.10</td> <td>33.80</td> <td>32.70</td> </tr> <tr> <td>Ziya-LLaMA-13B-Pretrain-v1</td> <td>31.10</td> <td>22.20</td> <td>27.40</td> <td>36.50</td> <td>33.80</td> <td>30.40</td> </tr> <tr> <td>LLaMA-13B</td> <td>29.8</td> <td>24.20</td> <td>28.40</td> <td>33.70</td> <td>29.60</td> <td>29.00</td> </tr> <tr> <td>LLaMA-7B</td> <td>26.80</td> <td>26.70</td> <td>26.20</td> <td>27.60</td> <td>25.70</td> <td>28.10</td> </tr> </table> </details> <details> <summary><b>MMLU 5-shot测试集(test)结果</b></summary> <table> <tr> <td>Model</td> <td>Avg</td> <td>STEM</td> <tdRelated Skills
proje
Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.
API
A learning and reflection platform designed to cultivate clarity, resilience, and antifragile thinking in an uncertain world.
openclaw-plugin-loom
Loom Learning Graph Skill This skill guides agents on how to use the Loom plugin to build and expand a learning graph over time. Purpose - Help users navigate learning paths (e.g., Nix, German)
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
