Occult-MoE

Source code for paper "Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference"

Overview

We present Occult, an algorithm-system co-design solution for communication-efficient expert parallelism.

We merge the token replicas transmitted to the same GPU into a single one to reduce all-to-all communication volume, together with a refactored matrix multiplication kernel tailored for this communication strategy to diminish unnecessary memory footprint.
We reschedule the expert placement in expert parallelism using a profiling dataset, aiming at clustering the frequently co-activated experts to boost the efficient all-to-all communication.
Occult can be integrated to both training and inference for MoE-based LLMs to achieve wall-clock speedup under heavy workloads.

Experiments

We examine the expert-parallelized training with 8- and 16- way expert parallelism using Occult, along with the evaluations on downstream tasks to validate the effectiveness of collaboration pruning.

8-way expert parallelism (1 node)

Devices: 8 x NVIDIA A6000 Ada

Latency Analysis

Caption: We demonstrate the training efficiency of Occult on DeepSeek-MoE (16B) with 8-way expert parallelism on a single node, compared with conventional expert parallelism MegaBlocks. The label "Occult (Pruning, $m$ GPUs)" denotes the $N_d$ value, $i.e.,$ pruning the expert collaboration for each token so that an individual token only activates experts within $m$ GPUs. We examine the training efficiency of MegaBlocks using both block sparse matrix multiplication and grouped GeMM.

Performance Analysis

<table border="0"> <tr> <th colspan="9" style="text-align: center;">DeepSeek-MoE Evaluation</th> </tr> </tr> <td>Task</td> <td>Strategy</td> <td>No Tuning</td> <td>Pruning within 1 GPU</td> <td>Pruning within 2 GPUs</td> <td>Pruning within 3 GPUs</td> <td>Pruning within 4 GPUs</td> <td>Pruning within 5 GPUs</td> <td>No Pruning</td> </tr> </tr> <td rowspan="2">MMLU</td> <td>Router-based</td> <td rowspan="2">37.95</td> <td>35.04</td> <td>40.41</td> <td>41.34</td> <td>41.43</td> <td>41.19</td> <td rowspan="2">38.66</td> </tr> </tr> <td>Similarity-based</td> <td>33.68</td> <td>39.80</td> <td><b>41.74</td> <td>41.40</td> <td>41.48</td> </tr> </tr> <td rowspan="2">OpenBookQA</td> <td>Router-based</td> <td rowspan="2">32.20</td> <td>33.8</td> <td>36.2</td> <td>37.2</td> <td><c>37.8</td> <td>37.2</td> <td rowspan="2">34.20</td> </tr> </tr> <td>Similarity-based</td> <td>33.4</td> <td>36.4</td> <td>36.8</td> <td><b>37.8</td> <td>37.2</td> </tr> </tr> <td rowspan="2">MathQA</td> <td>Router-based</td> <td rowspan="2">31.19</td> <td>32.93</td> <td>35.08</td> <td>34.97</td> <td>35.95</td> <td><b>36.08</td> <td rowspan="2">33.77</td> </tr> </tr> <td>Similarity-based</td> <td>33.17</td> <td>34.94</td> <td>35.51</td> <td>35.24</td> <td>35.61</td> </tr> </tr> <td rowspan="2">RACE</td> <td>Router-based</td> <td rowspan="2">38.85</td> <td>38.66</td> <td><b>40.38</td> <td>39.71</td> <td>39.71</td> <td>39.14</td> <td rowspan="2">40.10</td> </tr> </tr> <td>Similarity-based</td> <td>37.8</td> <td>38.85</td> <td>39.23</td> <td>39.71</td> <td>39.9</td> </tr> </tr> <td rowspan="2">SST-2</td> <td>Router-based</td> <td rowspan="2">64.68</td> <td>58.72</td> <td>64.22</td> <td>68.12</td> <td>72.36</td> <td>70.76</td> <td rowspan="2"><b>78.33</td> </tr> </tr> <td>Similarity-based</td> <td>61.7</td> <td>59.75</td> <td>70.64</td> <td>71.56</td> <td>70.53</td> </tr> </table>

Caption: We validate the effectiveness of the proposed collaboration pruning algorithms in Occult for 8-way expert parallelism with DeepSeek-MoE by evaluating on the popular benchmarks including MMLU, OpenBookQA, MathQA, RACE, and SST-2. The conclusions are similar to 4-way expert parallelism, $i.e.,$ pruning within 2 GPUs can obtain comparable performance than standard SFT with greatly improved training efficiency.

16-way expert parallelism (2 nodes)

Devices: 2 x 8 x NVIDIA A6000 Ada

Latency Analysis

Caption: We demonstrate the training efficiency of Occult on DeepSeek-MoE (16B) with 16-way expert parallelism on 2 nodes, compared with conventional expert parallelism MegaBlocks. In this case, the 64 dynamically-routed experts are scattered on 16 GPUs, $i.e.,$ 4 experts on each GPU$, which is consistent with DeepSeek-V3 (256 dynamically-routed experts and 64-way expert parallelism).

Performance Analysis

<table border="0"> <tr> <th colspan="8" style="text-align: center;">DeepSeek-MoE Evaluation</th> </tr> </tr> <td>Task</td> <td>Strategy</td> <td>No Tuning</td> <td>Pruning within 2 GPUs</td> <td>Pruning within 3 GPUs</td> <td>Pruning within 4 GPUs</td> <td>Pruning within 5 GPUs</td> <td>No Pruning</td> </tr> </tr> <td rowspan="2">MMLU</td> <td>Router-based</td> <td rowspan="2">37.95</td> <td>39.69</td> <td>40.37</td> <td>41.23</td> <td><b><b>41.62</td> <td rowspan="2">38.66</td> </tr> </tr> <td>Similarity-based</td> <td>39.23</td> <td>40.25</td> <td>41.31</td> <td>41.61</td> </tr> </tr> <td rowspan="2">OpenBookQA</td> <td>Router-based</td> <td rowspan="2">32.20</td> <td>36.2</td> <td>36.8</td> <td>37.6</td> <td>37.2</td> <td rowspan="2">34.20</td> </tr> </tr> <td>Similarity-based</td> <td>36.2</td> <td>36.4</td> <td>37.8</td> <td><b>38.6</td> </tr> </tr> <td rowspan="2">MathQA</td> <td>Router-based</td> <td rowspan="2">31.19</td> <td>35.61</td> <td>35.14</td> <td>35.21</td> <td><b>35.78</td> <td rowspan="2">33.77</td> </tr> </tr> <td>Similarity-based</td> <td>34.84</td> <td>35.21</td> <td>35.68</td> <td>35.71</td> </tr> </tr> <td rowspan="2">RACE</td> <td>Router-based</td> <td rowspan="2">38.85</td> <td>38.66</td> <td>39.04</td> <td>39.9</td> <td>38.95</td> <td rowspan="2"><b>40.10</td> </tr> </tr> <td>Similarity-based</td> <td>38.85</td> <td>39.04</td> <td>39.43</td> <td>39.43</td> </tr> </tr> <td rowspan="2">SST-2</td> <td>Router-based</td> <td rowspan="2">64.68</td> <td>71.9</td> <td>66.74</td> <td>75</td> <td>70.64</td> <td rowspan="2"><b>78.33</td> </tr> </tr> <td>Similarity-based</td> <td>56.77</td> <td>75.23</td> <td>73.97</td> <td>70.64</td> </tr> </table>

Caption: We validate the effectiveness of the proposed collaboration pruning algorithms in Occult for 16-way expert parallelism with DeepSeek-MoE by evaluating on the popular benchmarks including MMLU, OpenBookQA, MathQA, RACE, and SST-2. The conclusions are similar to 4-way expert parallelism, $i.e.,$ pruning within 2 GPUs can obtain comparable performance than standard SFT with greatly improved training efficiency.

Occult

Install / Use

README

Occult-MoE

Overview

Experiments

8-way expert parallelism (1 node)

Latency Analysis

Performance Analysis

16-way expert parallelism (2 nodes)

Latency Analysis

Performance Analysis