RLHF

Implementation of Chinese ChatGPT

Generate Convert Improve

Install / Use

/learn @sunzeyeah/RLHF

About this skill

Quality Score

0/100

README

Features

提供3大功能：

LLM模型预训练：支持常见模型的预训练，包括：decoder结构（LLaMA、GPT）、encoder结构（GLM）
LLM模型评测：参考GPT类模型，基于ZeroShot和FewShot实现
ChatGPT模型训练pipeline：根据Learning to Summarize from human feedback ，实现3大流程: SFT、Reward Model和RLHF
- 支持RLHF阶段 (1) 联合优化reward和policy (2) 单独优化policy，冻结reward
- 支持DPO作为Reward+RLHF的替代方案，可显著降低显存占用，同时实现RL的效果

Setup

1. Install deepspeed

git clone https://github.com/microsoft/DeepSpeed.git
cd deepspeed
rm -rf build
TORCH_CUDA_ARCH_LIST="7.0" DS_BUILD_OPS=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 | tee build.log

如果想创建binary wheel，方便在其他机器上安装，可使用如下命令，会在dist目录生成类似可安装文件deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl

git clone https://github.com/microsoft/DeepSpeed.git
cd deepspeed
rm -rf build
TORCH_CUDA_ARCH_LIST="7.0" DS_BUILD_OPS=1 python setup.py build_ext -j8 bdist_wheel 2>&1 | tee build.log

PS：需要根据下图，调整TORCH_CUDA_ARCH_LIST="7.0"为自己对应的NVIDIA GPU架构 image info

或运行torch.cuda.get_device_capability()获取自己GPU的架构

2. Install jieba

在使用Pangu类模型的时候，其special_token格式为<sep>、<pad>等，而tokenization_gptpangu.py中tokenize()函数会使用jieba进行分词。但直接pip install jieba，默认会将<和>直接切分开，使用jieba.add_word("<sep>")也没有作用，因为jieba直接hardcode了会自动切分的token，其中就包括了<和>。

因此需要执行：

git clone https://github.com/fxsjy/jieba.git
cd jieba

将代码clone到本地，修改jieba/__init__.py中re_han_default的取值，具体改动如下：

改动前：

re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-]+)", re.U)

改动后：

re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-<>]+)", re.U)

修改完成后使用pip install .进行本地编译安装，替换原有jieba。安装完成后，在代码中加入jieba.add_word("<sep>")（该代码已加入tokenization_gptpangu.py），即可解决将<sep>一类的special token切分为多个id的情况

3. Install apex (Optional)

git clone https://github.com/NVIDIA/apex
cd apex
pip install --global-option="--cpp_ext" --global-option="--cuda_ext" --no-cache -v --disable-pip-version-check .  2>&1 | tee build.log

如果想创建binary wheel，方便在其他机器上安装，可使用如下命令，会在dist目录生成类似可安装文件apex-0.0.1+7150e20-cp38-cp38-linux_x86_64.whl

git clone https://github.com/NVIDIA/apex
cd apex
python setup.py --cpp_ext --cuda_ext bdist_wheel 2>&1 | tee build.log

Data & Model Download

1. 预训练模型下载

| 模型 | size | huggingface地址 | 百度网盘地址 | 提取码 | | ----------- | ----------- | ----------- | ----------- | ----------- | | Pangu-350M | 659MB | sunzeyeah/pangu-350M | Pangu-350M | c5jj | | Pangu-2.6B | 9.8GB | sunzeyeah/pangu-2_6B | Pangu-2.6B | 2rad | | Pangu-13B | 23.6GB | sunzeyeah/pangu-13B | Pangu-13B | u3dx | | GLM-350M-chinese | 679MB | sunzeyeah/glm-350M-chinese | GLM-350M-chinese | ii8e | | GLM-10B-chinese | 18.4G | sunzeyeah/glm-10B-chinese | GLM-10B-chinese | fynj | | ChatGLM-6B | 25.6G | sunzeyeah/chatglm-6B | ChatGLM-6B | uq1k |

PS: 本repo提供的预训练模型下载中，

对于pytorch_model*.bin
- 如果源文件已包括，则不做改动
- 如果源文件不包括，则根据其提供的checkpoint转换为pytorch_model*.bin
其余文件可能相对原文件有改动，包括：modeling_*.py、tokenization_*.py、configuration_*.py、config.json和tokenizer.config

2. 数据下载

| 数据集 | size | huggingface地址 | 百度网盘地址 | 提取码 | | ----------- | ----------- | ----------- | ----------- | ----------- | | CLUE Benchmark | 500MB | | CLUE Benchmark | m6gt | | SFT & Reward Data | 5GB | sunzeyeah/chinese_chatgpt_corpus | SFT & Reward Data | ecyc | | 百科 | 652MB | | baike_qa_2019 | 7jad | | 知道问答 | 847MB | | zhidao | neds | | 对联 | 221MB | | couplets | 54ey | | 古文 | 125MB | | Classical & Modern | a4cr | | 古诗词 | 87MB | | chinese poetry | 5zzj | | 微博新闻评论 | 522MB | | weibo summary comments | w0g1 |

PS: SFT & Reward Data基于百科、知道问答、对联、古文、古诗词、微博新闻评论数据构造，可直接用于SFT和Reward阶段训练。详见data_prepare.py

Usage

1. LLM模型预训练

对开源LLM进行增量预训练，基于deepspeed实现。目前支持2类模型架构：

decoder结构：LLaMA、Baichuan、Pangu
encoder结构：GLM、ChatGLM

cd examples
bash pretrain.sh

2. LLM模型评测

对开源中文LLM进行ZeroShot、OneShot或FewShot的评测。详见eval_pretrain.py 和 data.py。

目前支持的评测任务：

C-Eval
MMLU
CLUEBenchmark ：评测方法和prompt模板参考Pangu-alpha论文

目前支持的开源模型:

LLaMA及相关衍生模型
ChatGLM（1和2）
Baichuan
Qwen
Pangu
GLM

cd examples
bash eval_pretrain.sh

3. SFT

使用开源LLM + SFT&Reward数据进行SFT训练

cd examples
bash train_sft.sh

4. Reward Model

使用SFT模型 + SFT&Reward数据进行Reward模型训练

cd examples
bash train_reward.sh

5. RLHF

利用PPO算法和Reward Model，进一步更新SFT模型。基于开源框架DeepSpeedChat 实现

cd examples
bash train_rlhf.sh

6. DPO

利用DPO算法替代Reward+RLHF的pipeline，免去训练Reward模型，同时达到RL训练的效果，该方法可显著降低显存占用。基于开源框架trl 实现

cd examples
bash train_dpo.sh

Results

1. LLM模型评测

<details> <summary>C-Eval 5-shot测试集(test)结果</summary> <table> <tr> <td>Model</td> <td>Avg</td> <td>Avg(Hard)</td> <td>STEM</td> <td>Social Science</td> <td>Humanities</td> <td>Other</td> </tr> <tr> <td>Baichuan2-13B-Chat</td> <td style="color:red">56.30</td> <td>34.20</td> <td style="color:red">48.20</td> <td style="color:red">70.00</td> <td style="color:red">60.50</td> <td>54.20</td> </tr> <tr> <td>xverse-13B</td> <td>55.30</td> <td>32.50</td> <td>45.90</td> <td>66.70</td> <td>59.50</td> <td style="color:red">57.60</td> </tr> <tr> <td>Qwen-7B-Chat</td> <td>54.70</td> <td>35.40</td> <td>47.90</td> <td>68.30</td> <td>58.70</td> <td>50.00</td> </tr> <tr> <td>Baichuan-13B-Base</td> <td >53.70</td> <td style="color:red">35.60</td> <td>46.80</td> <td>65.80</td> <td>58.00</td> <td>50.80</td> </tr> <tr> <td>Baichuan2-7B-Chat</td> <td>52.50</td> <td>33.80</td> <td>45.70</td> <td>64.20</td> <td>56.60</td> <td>50.20</td> </tr> <tr> <td>ChatGLM2-6B</td> <td>51.20</td> <td>33.40</td> <td>46.90</td> <td>63.00</td> <td>51.60</td> <td>47.70</td> </tr> <tr> <td>Baichuan-13B-Chat</td> <td>47.90</td> <td>31.50</td> <td>41.40</td> <td>56.80</td> <td>53.00</td> <td>46.50</td> </tr> <tr> <td>Baichuan-7B</td> <td>44.20</td> <td>31.70</td> <td>39.20</td> <td>53.30</td> <td>47.30</td> <td>41.90</td> </tr> <tr> <td>Ziya-LLaMA-13B-v1.1</td> <td>40.10</td> <td>30.30</td> <td>35.80</td> <td>47.30</td> <td>42.80</td> <td>38.50</td> </tr> <tr> <td>ChatGLM1.1-6B</td> <td>38.10</td> <td>28.60</td> <td>33.60</td> <td>46.70</td> <td>40.90</td> <td>35.70</td> </tr> <tr> <td>AtomGPT-13B-56k</td> <td>37.60</td> <td>25.30</td> <td>32.00</td> <td>44.70</td> <td>42.80</td> <td>36.10</td> </tr> <tr> <td>LLaMA2-13B-chat</td> <td>37.10</td> <td>29.30</td> <td>34.60</td> <td>43.60</td> <td>35.90</td> <td>37.00</td> </tr> <tr> <td>ChatGLM-6B</td> <td>36.30</td> <td>27.20</td> <td>32.90</td> <td>42.80</td> <td>38.10</td> <td>34.90</td> </tr> <tr> <td>LLaMA-30B</td> <td>35.90</td> <td>29.90</td> <td>34.40</td> <td>42.40</td> <td>33.30</td> <td>35.60</td> </tr> <tr> <td>LLaMA2-7B-chat</td> <td>33.50</td> <td>27.30</td> <td>31.60</td> <td>38.10</td> <td>33.80</td> <td>32.70</td> </tr> <tr> <td>Ziya-LLaMA-13B-Pretrain-v1</td> <td>31.10</td> <td>22.20</td> <td>27.40</td> <td>36.50</td> <td>33.80</td> <td>30.40</td> </tr> <tr> <td>LLaMA-13B</td> <td>29.8</td> <td>24.20</td> <td>28.40</td> <td>33.70</td> <td>29.60</td> <td>29.00</td> </tr> <tr> <td>LLaMA-7B</td> <td>26.80</td> <td>26.70</td> <td>26.20</td> <td>27.60</td> <td>25.70</td> <td>28.10</td> </tr> </table> </details> <details> <summary>MMLU 5-shot测试集(test)结果</summary> <table> <tr> <td>Model</td> <td>Avg</td> <td>STEM</td> <td

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

API

A learning and reflection platform designed to cultivate clarity, resilience, and antifragile thinking in an uncertain world.

openclaw-plugin-loom

Loom Learning Graph Skill This skill guides agents on how to use the Loom plugin to build and expand a learning graph over time. Purpose - Help users navigate learning paths (e.g., Nix, German)

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

sunzeyeah

View profile

View on GitHub

GitHub Stars289

CategoryEducation

Updated16d ago

Forks35

sunzeyeah/RLHF

Languages

Python

Security Score

85/100

Audited on Mar 8, 2026

No findings