SkillAgentSearch skills...

GdGPT

Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.

Install / Use

/learn @CoinCheung/GdGPT

README

中文版

Train LLM with deepspeed in pipeline mode

This repo provides a codebase based on deepspeed pipeline mode with which you can pretrain or finetune LLM faster and more memory-efficiently than zero mode.

Currently, supported models are: bloom, llama, baichuan2-7b, chatglm3-6b, mixtral-8x7b.<br>

Following is benchmark done with 8 A100 (SXM-40G) gpu, the model is llamaV1-7b, with settngs of micro_batch_size=1global_batch_size=128fp16=True. The speed is measured as "sample/s" within 20 global steps.

If your gpu memory is sufficient, you can try to set micro_batch_size=2, sometimes this would further speed up training if your global_batch_size is large enough.

<table class="center" style="margin-left: auto; margin-right: auto; font-size: 160%"><tbody> <!-- START TABLE --> <!-- TABLE HEADER --> <tr> <td align="center"><sup><sub>max_seq_len</sub></sup></td> <td align="center"><sup><sub>256</sub></sup></td> <td align="center"><sup><sub>384</sub></sup></td> <td align="center"><sup><sub>512</sub></sup></td> <td align="center"><sup><sub>768</sub></sup></td> <td align="center"><sup><sub>1024</sub></sup></td> <td align="center"><sup><sub>1280</sub></sup></td> <td align="center"><sup><sub>1536</sub></sup></td> <td align="center"><sup><sub>2048</sub></sup></td> <td align="center"><sup><sub>3072</sub></sup></td> <td align="center"><sup><sub>4096</sub></sup></td> </tr> <tr> <td align="center"><sup><sub>zero3<br/>(aka fsdp)</sub></sup></td> <td align="center"><sup><sub>15.76</sub></sup></td> <td align="center"><sup><sub>13.37</sub></sup></td> <td align="center"><sup><sub>13.34</sub></sup></td> <td align="center"><sup><sub>12.67</sub></sup></td> <td align="center"><sup><sub>oom</sub></sup></td> <td align="center"><sup><sub>oom</sub></sup></td> <td align="center"><sup><sub>oom</sub></sup></td> <td align="center"><sup><sub>oom</sub></sup></td> <td align="center"><sup><sub>oom</sub></sup></td> <td align="center"><sup><sub>oom</sub></sup></td> </tr> <tr> <td align="center"><sup><sub>zero3++</sub></sup></td> <td align="center"><sup><sub>13.10</sub></sup></td> <td align="center"><sup><sub>12.88</sub></sup></td> <td align="center"><sup><sub>12.30</sub></sup></td> <td align="center"><sup><sub>oom</sub></sup></td> <td align="center"><sup><sub>oom</sub></sup></td> <td align="center"><sup><sub>oom</sub></sup></td> <td align="center"><sup><sub>oom</sub></sup></td> <td align="center"><sup><sub>oom</sub></sup></td> <td align="center"><sup><sub>oom</sub></sup></td> <td align="center"><sup><sub>oom</sub></sup></td> </tr> <tr> <td align="center"><sup><sub>pipeline</sub></sup></td> <td align="center"><sup><sub>56.85</sub></sup></td> <td align="center"><sup><sub>49.43</sub></sup></td> <td align="center"><sup><sub>43.16</sub></sup></td> <td align="center"><sup><sub>32.84</sub></sup></td> <td align="center"><sup><sub>24.47</sub></sup></td> <td align="center"><sup><sub>19.77</sub></sup></td> <td align="center"><sup><sub>16.18</sub></sup></td> <td align="center"><sup><sub>oom</sub></sup></td> <td align="center"><sup><sub>oom</sub></sup></td> <td align="center"><sup><sub>oom</sub></sup></td> </tr> <tr> <td align="center"><sup><sub>pipeline<br/>(flash-attn)</sub></sup></td> <td align="center"><sup><sub>45.79</sub></sup></td> <td align="center"><sup><sub>45.06</sub></sup></td> <td align="center"><sup><sub>41.09</sub></sup></td> <td align="center"><sup><sub>34.14</sub></sup></td> <td align="center"><sup><sub>26.29</sub></sup></td> <td align="center"><sup><sub>23.38</sub></sup></td> <td align="center"><sup><sub>19.48</sub></sup></td> <td align="center"><sup><sub>15.00</sub></sup></td> <td align="center"><sup><sub>12.54</sub></sup></td> <td align="center"><sup><sub>7.75</sub></sup></td> </tr> </tbody></table>

We can see that zero++ is slower than zero on my platform, that's roughly because I train the model on single node, which cannot make good use of zero++ cross-node communication ability. Besides, the speed of zero/zero++ goes down slowly when training sequence length goes up. This can be because zero/zero++ suffers from its communication bottleneck even when longer sequence brings more computation burden. This means that the computation capability of gpus are not fully utilized due to the limitation of communication.

If you would like to try zero/zero++ yourself, you can run this script (not recommended, since pipeline is better):

    $ deepspeed train_ds_zero.py --config configs/ds_config_zero.yml

Environment

  • AMD EPYC 7742 64-Core Processor
  • 512G cpu memory
  • A100 (SXM-40G) x 8
  • ubuntu 18.04
  • python 3.8.12
  • driver 520.61.05
  • cuda11.8 + cudnn8
  • deepspeed==0.11.1
  • torch==2.1.0
  • sentencepiece
  • transformers==4.36.2
  • protobuf==3.20.0 (python pip install)
  • flash_attn==2.0.2

Pipeline Training

1. Prepare dataset

The training samples should be in json format as follows:

[
    // samples used for pretraining  
    { 
        "type": "pretrain",
        "text": "Cai Xukun (born August 2, 1998), better known by the mononym Kun (stylized as KUN), is a Chinese singer-songwriter, dancer and rapper. He debuted as a member of SWIN and its sub-unit SWIN-S on October 18, 2016, after participating in the first and second seasons of the Chinese reality show Super Idol.[1] After leaving the group and its company Yihai Entertainment, he participated in iQiyi's reality survival show Idol Producer, finishing first and debuting as the leader/center of temporary Chinese boy group Nine Percent, on April 6, 2018.[2][3] He was a cast member of variety show Keep Running from 2020 to 2022."
    },

    // samples used for instruct tuning, there should not be an empty "input" field
    {
        "type": "instruct",
        "instruct": "Fill out the blank in the following sentence",
        "input": "Cai Xukun loves singing, dancing, rapping and ______",
        "output": "playing basketball"
    },
    // if you do not have an "input" field, you can remove it
    {
        "type": "instruct",
        "instruct": "Write a peom associated with rain.",
        "output": "Rain is falling all around, \nIt falls on field and tree,  \nIt rains on the umbrella here, \nAnd on the ships at sea. "
    },

    // samples used for multi-round conversation
    {
        "type": "conversation",
        "rounds": [
            ["ask", "Hello"],
            ["ans", "Hello, what can I do for you ?"],
            ["ask", "Tell me what day it is today."],
            ["ans", "Today is Wednesday."],
            ["ask", "Who is caixukun?"],
            ["ask", "caixukun is a Chinese idol, who loves singing, dancing, rapping and playing basketball"],
            ["ask", "When was caixukun born?"],
            ["ans", "In the year of 1998."]
        ]
    },

    // samples used for multi-round conversation with api ability
    {
        "type": "conver_has_api",

        // this field gives a brief doc of api
        "api_desc": "getVerse: Retrieve the text of a specific verse from the XiaoHuangShu.\nParameters: {\"book\": \"Required. string. The name of the book.\", \"chapter\": \"Required. integer. The chapter number.\", \"verse\": \"Required. integer. The verse number.\"}\nOutput: Returns a JSON object containing the text of the requested verse.\n - Format: application/json\n - Structure: Object{text}\nsearch: Search the XiaoHuangShu for specific keywords or phrases.\nParameters: {\"query\": \"Required. string. The keyword or phrase to search for.\", \"version\": \"string. The XiaoHuangShu version to search in.\"}\nOutput: Returns a JSON object containing an array of search results, each containing the book, chapter, and verse where the keyword or phrase was found, as well as the text of the verse.\n - Format: application/json\n - Structure: Array[Object{book, chapter, verse, text}]\ngetVersions: Retrieve metadata for specific XiaoHuangShu versions.\nParameters: {\"language\": \"string. The language of the XiaoHuangShu version.\", \"publisher\": \"string. The publisher of the XiaoHuangShu version.\"}\nOutput: Returns a JSON object containing an array of XiaoHuangShu versions that match the specified criteria, each containing the name of the version, the language used, the publication date, and the publisher.\n - Format: application/json\n - Structure: Array[Object{name, language, publication_date, publisher}]\n",

        "rounds": [
            ["ask", "Hello"],
            ["ans", "Hi, what can I do for you ?"],
            ["ask", "can you search the Bible for the phrase \"love your neighbor\"? Please include the book, chapter, and verse where it's found."],
            ["ans-api", {
                "actions": [
                    {
                        "inner_thought": "I need to use the search tool to find the phrase in the Bible.",
                        "api_name": "search",
                        "api_param": "{\"query\": \"XiaoHuangShu\", \"version\": \"King James Version\"}",
                        "api_res": "Status Code: 200. Response: {\"search_results\":[{\"book\":\"Mark\",\"chapter\":12,\"verse\":31,\"text\":\"And the second is like, namely this, Thou shalt love thy neighbour as thyself. There is none other commandment greater than these.\"},{\"book\":\"Matthew\",\"chapter\":22,\"verse\":39,\"text\":\"And the second is like unto it, Thou shalt love thy neighbour as thyself.\"},{\"book\":\"Luke\",\"chapter\":10,\"verse\":27,\"text\":\"And he answering said, Thou shalt love the Lord thy God with all thy heart, and with all thy soul, and with all thy strength, and with all thy mind; and thy neighbour as thyself.\"}]}",
                    },
                    {
                        "inner_thought": "Let me search again with another key word",
                        "api_name": "search",
                        "api_param": "{\"query\": \"GuoChanQu\", \"version\": \"King James Version\"}",
                        "api_res": "Status Code: 200. Response: {\"search
View on GitHub
GitHub Stars97
CategoryDevelopment
Updated1mo ago
Forks10

Languages

Python

Security Score

100/100

Audited on Mar 3, 2026

No findings