MegaDLMs
GPU-optimized framework for training diffusion language models at any scale. The backend of Quokka, Super Data Learners, and OpenMoE 2 training.
Install / Use
/learn @JinjieNi/MegaDLMsREADME
MegaDLMs
<h4>GPU-optimized framework for training diffusion language models at any scale.</h4> <div align="left"> <p align="center" width="100%"> <img src="images/vs_other_backend.png" width="80%" height="100%"> </p> <p align="center" width="100%"> <img src="images/weak_scaling.png" width="80%" height="100%"> </p> <br>Highlights
-
Comprehensive Training Pipelines: Full support for Diffusion Language Models (DLMs) and Autoregressive LMs, from pre-training and SFT to RL, on both dense and MoE architectures.
-
Ultra Speed and Scalability: MegaDLMs offers up to 47% Model FLOP Utilization (MFU) and 3× faster training speed compared with other frameworks (see here for benchmarking).
-
Cutting-edge Backend: Leverage flexible parallelism from Megatron-LM and GPU-optimized Transformer layers with fused kernels and full-precision (FP8, FP16, BF16) support from Transformer Engine.
-
HuggingFace Integration: Seamlessly work with HuggingFace checkpoints.
Latest News
- [2025-11-2] We release MegaDLMs, the training backend for Quokka, Super Data Learners, and OpenMoE 2, an ultra fast and scalable framework for any-scale DLM training. We will merge the MoE part once OpenMoE 2 training is done.
Quick Start
1. Installation
We strongly recommend using the PyTorch NGC Container for optimal compatibility.
The 24.11-py3 version (nvcr.io/nvidia/pytorch:24.11-py3) is recommended:
docker pull nvcr.io/nvidia/pytorch:24.11-py3
Or building an image with the docker file starting with:
FROM nvcr.io/nvidia/pytorch:24.11-py3
# the remaining Dockerfile content
If external images are not supported in your cluster, follow the Complete Installation Guide to install - Docker, pip variants (dev,lts,etc.), source installation, and system requirements. The specific requirements are detailed in
requirements.txt.
2. Setup Envs
Setup the environment variables as instructed in envs/.env.
Project Structure
We also built an interactive doc with DeepWiki to help you better understand this repo.
mega-dlms/
├── megatron/
│ ├── core/ # Megatron Core (kernels, parallelism, building blocks)
│ │ ├── models/ # Transformer models
│ │ ├── transformer/ # Transformer building blocks
│ │ ├── tensor_parallel/ # Tensor parallelism
│ │ ├── pipeline_parallel/ # Pipeline parallelism
│ │ ├── distributed/ # Distributed training (FSDP, DDP)
│ │ ├── optimizer/ # Optimizers
│ │ ├── datasets/ # Dataset loaders
│ │ ├── inference/ # Inference engines
│ │ └── export/ # Model export (e.g. TensorRT-LLM)
│ ├── training/ # Training scripts
│ ├── inference/ # Inference server
│ ├── legacy/ # Legacy components
│ └── post_training/ # Post-training (RLHF, etc.)
├── examples/ # Ready-to-use training examples
├── tools/ # Utility tools
├── tests/ # Comprehensive test suite
└── docs/ # Documentation
<br>
System Requirements
Hardware Requirements
- FP8 Support: NVIDIA Hopper, Ada, Blackwell GPUs
- Recommended: NVIDIA Turing architecture or later
Software Requirements
- CUDA/cuDNN/NCCL: Latest stable versions
- PyTorch: Latest stable version
- Transformer Engine: Latest stable version
- Python: 3.12 recommended
Training
Data Preparation
MegaDLMs consumes tokenized corpus, so you have to tokenize your training/validation set in advance and store it somewhere. Below we show a quick example of how to tokenize the data. A tokenization script tools/preprocess_data.py is provided, which consumes .jsonl files as input, shown below.
E.g., if you prepare all your data into a .jsonl file and run tools/preprocess_data.py once to tokenize it with --output-prefix path/to/processed_data, you will get two tokenized files: path/to/processed_data.bin and path/to/processed_data.idx, where the .bin file store the tokenized ids and the .idx file store the positions of sequences.
You can also prepare your corpus into N .jsonl files and then tokenize them into N files by running tools/preprocess_data.py N times. Search for --data-path or --per-split-data-args-path in the megatron/training/arguments.py to learn more about how to use it in training.
JSONL Data Format
{"text": "Your training text here..."}
{"text": "Another training sample..."}
Basic Preprocessing
python tools/preprocess_data.py \
--input data.jsonl \
--output-prefix path/to/processed_data \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model /path/to/tokenizer.model \
--workers 8 \
--append-eod
Key Arguments
--input: Path to input JSON/JSONL file--output-prefix: Prefix for output binary files (.bin and .idx)--tokenizer-type: Tokenizer type (HuggingFaceTokenizer,GPT2BPETokenizer, etc.)--tokenizer-model: Path to tokenizer model file--workers: Number of parallel workers for processing--append-eod: Add end-of-document token
Train from Scratch
We provide examples of the whole pipeline for training DLMs from scratch. You can find them under examples/dlm_training.
Pre-train
source envs/.env; bash examples/dlm_training/dlm_pretrain_1.7b.sh
Find all training arguments in
custom_args/difflm.pyandmegatron/training/arguments.py.
Checkpoint Conversion
source envs/.env; bash examples/dlm_training/ckpt_conversion.sh
source envs/.env; bash examples/dlm_training/ckpt_conversion_validation.sh
Generate with Your Trained DLM
source envs/.env; python examples/dlm_generation/dlm_inference.py
Train from a pre-trained HuggingFace checkpoint
We will provide an example soon. You can also try to set it up with the below changes, it's quite similar.
Convert the huggingface checkpoint to megatron format
tools/weights_conversion/hf_to_megatron_te.py
Samely, verify the precision with:
tools/weights_conversion/utils/verify_correctness_dlm.py
In training, you need to additionally specify the --load, --finetune, and --use-checkpoint-args as detailed in megatron/training/arguments.py.
Parallelism Strategies
Data Parallelism (DP)
Standard Data Parallel
# Standard DDP - replicate model on each GPU
torchrun --nproc_per_node=8 pretrain_difflm.py \
--data-parallel-sharding-strategy no_shard
Fully Sharded Data Parallel (FSDP)
# Megatron's optimized FSDP (~15% faster than PyTorch FSDP2)
--use-custom-fsdp
# PyTorch FSDP2
--use-torch-fsdp2
# Sharding strategies
--data-parallel-sharding-strategy optim # Shard optimizer states (ZeRO-1)
--data-parallel-sharding-strategy optim_grads # Shard gradients + optimizer (ZeRO-2)
--data-parallel-sharding-strategy optim_grads_params # Shard parameters + gradients + optimizer (ZeRO-3)
Tensor Parallelism (TP)
Split individual model layers across GPUs:
--tensor-model-parallel-size 4 # 4-way tensor parallelism
--sequence-parallel # Enable sequence parallelism (recommended with TP)
Pipeline Parallelism (PP)
Split model depth across GPUs:
--pipeline-model-parallel-size 8 # 8 pipeline stages
--virtual-pipeline-model-parallel-size 4 # Virtual pipeline for better load balancing
Context Parallelism (CP)
Split long sequences across GPUs for handling long contexts:
--context-parallel-size 2 # 2-way context parallelism
--cp-comm-type p2p # Communication: p2p, a2a, allgather, a2a+p2p
--hierarchical-context-parallel-sizes 2 4 # Hierarchical context parallelism
Expert Parallelism (EP)
For Mixture of Experts (MoE) models:
--expert-model-parallel-size 4 # 4-way expert parallelism
--num-experts 8 # 8 experts per MoE layer
--moe-grouped-gemm # Optimize expert computation
Combining Parallelism Strategies
Parallelism Selection Guide
Based on NVIDIA NeMo production configurations:
| Model | Size | GPUs | TP | PP | CP | EP | Notes | |-------|------|------|----|----|----|----|-------| | LLama-3 | 8B | 8 | 1 | 1 | 2 | 1 | CP for long seqlen (8K) | | LLama-3 | 70B | 64 | 4 | 4 | 2 | 1 | TP+PP | | LLama-3.1 | 405B | 1024 | 8 | 8 | 2 | 1 | 3D parallelism for scale | | GPT-3 | 175B | 128-512 | 4 | 8 | 1 | 1 | Large model config | | Mixtral | 8x7B | 64 | 1 | 4 | 1 | 8 | EP for MoE | | Mixtral | 8x22B | 256 | 4 | 4 | 8 | 8 | Combined TP+EP for large MoE | | DeepSeek-V3 | 671B | 1024 | 2 | 16 | 1 | 64 | Large MoE config |
MoE-Specific Requirements
Important: When combining Expert Parallelism (EP) with Tensor Parallelism (TP), Sequence Parallelism (SP) must be enabled.
Performance Optimizations
| Feature | Flag | Benefit | |---------|------|---------|
