SkillAgentSearch skills...

Mist

[EuroSys'25] Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization

Install / Use

/learn @CentML/Mist
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Mist Artifact for EuroSys 25

In this repository, we provide the artifact for the paper Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization 🚀.

Mist is an advanced automatic distributed training configuration optimizing system designed to optimize large language model (LLM) training by co-optimizing parallelism techniques (data, tensor, and pipeline parallelism) alongside memory footprint reduction strategies (activation checkpointing, redundancy elimination, and offloading).

Key Features:

  • 🚀 Optimized Performance: Achieves up to 2.04× speedup compared to state-of-the-art automatic systems.
  • ⚡ Smart Parallelism & Memory Optimization: Dynamically balances memory usage and compute efficiency.
  • 🔍 Symbolic Performance Analysis: Rapidly explores optimization configurations using symbolic expressions.
  • 🔄 Overlap-Centric Scheduling: Maximizes computation-communication overlap for efficient GPU utilization.

Non Goals ⚠️:

  • Production: Mist is a research prototype built on PyTorch to explore distributed training optimizations. Certain production features like dynamic gradient scaling, gradient clipping, and training monitoring are intentionally omitted. For production use, we recommend applying Mist’s optimized strategies in Megatron-LM and DeepSpeed. We also disabled these features for baselines for fair performance comparison.
  • Numeric Stability: even though we tried our best to make sure the execution is correct and tested the correctness for several base cases, numerical instabilities may arise due to complex overlap scheduling and data race conditions in complicated configurations. We are happy to further improve it once we spot these cases.

Prerequisite (Skip for AE Reviewers)

We recommend to use Docker Engine for building the artifact to fully control all software dependencies. Please follow the instructions to Install Docker Engine and NVIDIA Container Toolkit first. Note that if the current user is not in the docker user group, all following docker-related commands requires root privilege (i.e. with sudo) to run.

For convenience, we also provide the installation script below (extracted from official guide):

curl https://get.docker.com | sh && sudo systemctl --now enable docker

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Step 1: Set-up

Clone the repository and build the docker container. NOTE: for users with GPUs different from L4 GPUs (sm_89), you may have to change the environment variable TORCH_CUDA_ARCH_LIST in the Dockerfile. You can find details here.

git clone https://github.com/Dazz993/Mist.git

cd Mist
docker build -t mist -f Dockerfile .

Step 2: Kick-the-Tires (Functionality Test, Est. Time: 10 mins)

Step 2.1: Run the docker container

docker run --gpus all -it --rm --privileged --ipc=host --shm-size=20G --ulimit memlock=-1 --name "mist" -v $(pwd):/workspace/ mist

Step 2.2: Set up GPU frequencies

To get consistent and stable results especially for machines like L4, fix the gpu frequency

# Check supported frequencies
nvidia-smi -q -d SUPPORTED_CLOCKS
# Set the frequencies (e.g. for L4 GPUs)
nvidia-smi -ac 6251,1050

Step 2.3: Analyze the small case

Mist can analyze the model execution time (including the breakdown for pipeline parallelism) and the memory usage efficiently. We use test-small-base config as an example, which is a *GPT-1.3B model running on 2 GPUs with BSZ=8, DP=2, FlashAttn=False. This is the best configuration that Megatron-LM can achieve. The corresponding YAML file is /benchmark/mist/configs/test-small-base.yaml. This YAML file contains network and memory parameters for GCP L4 GPUs. For other GPUs, this setup can be used as a functionality test.

cd /workspace/benchmark/mist/analysis/
python run.py --config-name test-small-base

Expected results:

# ... Breakdowns ...
# ..................
Total latency: 10.659405381925026
Peak fwd memory: [19255.25]
Peak bwd memory: [19503.125005722046]

Step 2.4: Exec the small case:

Mist can directly run the configurations in an efficient way.

cd /workspace/benchmark/mist/exec/
torchrun --nproc-per-node 2 benchmark_one_case.py --config-name test-small-base

Expected results:

[Total]   Median: 11.3290 s, Mean: 11.3290 s, Std: 0.05885070
Total Latency: 11.3290
[Stage Peak]      Allocated memories: [19273.83] MB
[Stage Peak]      Reserved  memories: [20862.00, 21096.00] MB

Mist provides highly accurate memory estimation, ensuring reliable resource planning. However, performance estimation may have slight deviations, as Mist primarily focuses on comparing performance across different configurations rather than absolute runtime. Some constant overheads, like optimizer step time, are omitted since they remain the same across configurations. We will cover that later.

Step 2.5: Tune the small case

Then let's use Mist to tune the best configuration:

cd /workspace/benchmark/mist/tune/
python tune_one_case.py --config-name test-small-base +output_path=/workspace/benchmark/mist/tune/results/test-small-mist

Expected results:

Best cost: 9.26892465
Best solution: [16,
 [(((0, 11), (1, 1), 5, 0), (2, 1, 1, 0, 0, 1, 0.0, 0.0, 0.0, 0.0)),
  (((12, 23), (1, 1), 0, 0), (2, 1, 1, 0, 0, 1, 0.0, 0.0, 0.0, 0.0))]
]
Saved the best solution to /workspace/benchmark/mist/tune/results/test-small-mist.yaml

The outputs can be interpreted as:

Gradient Accumulation Steps: 16. Two pipeline stages.
- ----------------------------------------------------
- (0, 11): (layer_idx_start, layer_idx_end)
- (1, 1) : (nnodes, nprocs_per_node)
- 5: number of checkpointed layers in a single stage
- ----------------------------------------------------
- (2, 1, 1): (Batch size, DP, TP)
- (0, 0, 1): (WeightsSharding, GradsSharding, OptSharding)
- (0.0, 0.0, 0.0, 0.0): (W, G, O, A). where they map to weights, grads, 
                        optimizer states, and activation offloading ratio.

Execute the tuned configurations:

cd /workspace/benchmark/mist/exec/
torchrun --nproc-per-node 2 \
    benchmark_one_case.py \
    --config-path /workspace/benchmark/mist/tune/results/ \
    --config-name test-small-mist

Expected results:

Total Latency: 9.9345

Therefore, the speedup is roughly ~14%. This is the datapoint in (Figure 11, (a) - 1).

Step 3: Run Single-Node Performance Evaluation [Specifically for GCP L4 GPUs] (Est. Time: 3.5 hours)

For L4 GPUs, we directly provide the configurations that are tuned by us that can be used to quickly test the speedup of Mist compared to baselines. We also provide a general process for evaluating on a brand new cluster. See the next section.

Step 3.1: Evaluate Mist

We provide the best configurations that is found in our used L4 clusters under /workspace/benchmark/mist/tuned_configs/.

cd /workspace/benchmark/mist/tuned_configs/
bash run_single_node.sh

Results are summarized in /workspace/benchmark/mist/tuned_configs/l4-24gb/gpt/summary.json and corresponding llama file.

Step 3.2: Evaluate Megatron-LM

Then we evaluate the performance of Megatron-LM. The best configurations of Megatron-LM are manually found by us and mostly match our searching results from our baseline search space.

cd /workspace/benchmark/megatron/
bash scripts/tops/l4/gpt2/1_8xl4_node_1_pcie.sh
bash scripts/tops/l4/llama/1_8xl4_node_1_pcie.sh

Results are under /workspace/benchmark/megatron/results.

Step 3.3: Evaluate DeepSpeed

Similarly, we evaluate the performance of DeepSpeed.

cd /workspace/benchmark/deepspeed/
bash scripts/tops/l4/gpt2/1_8xl4_node_1_pcie.sh
bash scripts/tops/l4/llama/1_8xl4_node_1_pcie.sh

Results are under /workspace/benchmark/deepspeed/results.

Step 3.4: Collect Results

We provide a python file to collect the results for easy comparison.

cd /workspace/benchmark/
python scripts/collect_single_node_results_v1.py

Expected Results (for clearity we ignore absolute numbers)

+----------------------+-----------------------+------------------------+
| SpeedUp              | SpeedUp vs Megatron   | SpeedUp vs DeepSpeed   |
+======================+=======================+========================+
| gpt2-1.3b-flash_True | 1.279X                | 1.473X                 |
+----------------------+-----------------------+------------------------+
| gpt2-2.7b-flash_True | 1.193X                | 1.488X                 |
+----------------------+-----------------------+------------------------+
| gpt2-7b-flash_True   | 1.191X                | 1.709X                 |
+----------------------+-----------------------+------------------------+
+-----------------------+-----------------------+------------------------+
| SpeedUp               | SpeedUp vs Megatron   | SpeedUp vs DeepSpeed   |
+=======================+=======================+========================+
| llama-1.3b-flash_True | 1.557X                | 1.498X                 |
+-----------------------+----------------

Related Skills

View on GitHub
GitHub Stars22
CategoryDevelopment
Updated11d ago
Forks5

Languages

Python

Security Score

90/100

Audited on Mar 18, 2026

No findings