MegaDLMs

<h4>GPU-optimized framework for training diffusion language models at any scale.</h4>

Highlights

Comprehensive Training Pipelines: Full support for Diffusion Language Models (DLMs) and Autoregressive LMs, from pre-training and SFT to RL, on both dense and MoE architectures.
Ultra Speed and Scalability: MegaDLMs offers up to 47% Model FLOP Utilization (MFU) and 3× faster training speed compared with other frameworks (see here for benchmarking).
Cutting-edge Backend: Leverage flexible parallelism from Megatron-LM and GPU-optimized Transformer layers with fused kernels and full-precision (FP8, FP16, BF16) support from Transformer Engine.
HuggingFace Integration: Seamlessly work with HuggingFace checkpoints.

<br>

Latest News

[2025-11-2] We release MegaDLMs, the training backend for Quokka, Super Data Learners, and OpenMoE 2, an ultra fast and scalable framework for any-scale DLM training. We will merge the MoE part once OpenMoE 2 training is done.

<br>

Quick Start

1. Installation

We strongly recommend using the PyTorch NGC Container for optimal compatibility.

The 24.11-py3 version (nvcr.io/nvidia/pytorch:24.11-py3) is recommended:

docker pull nvcr.io/nvidia/pytorch:24.11-py3

Or building an image with the docker file starting with:

FROM nvcr.io/nvidia/pytorch:24.11-py3

# the remaining Dockerfile content

If external images are not supported in your cluster, follow the Complete Installation Guide to install - Docker, pip variants (dev,lts,etc.), source installation, and system requirements. The specific requirements are detailed in requirements.txt.

2. Setup Envs

Setup the environment variables as instructed in envs/.env.

<br>

Project Structure

We also built an interactive doc with DeepWiki to help you better understand this repo.

mega-dlms/
├── megatron/                    
│   ├── core/                    # Megatron Core (kernels, parallelism, building blocks)
│   │   ├── models/              # Transformer models
│   │   ├── transformer/         # Transformer building blocks
│   │   ├── tensor_parallel/     # Tensor parallelism
│   │   ├── pipeline_parallel/   # Pipeline parallelism
│   │   ├── distributed/         # Distributed training (FSDP, DDP)
│   │   ├── optimizer/           # Optimizers
│   │   ├── datasets/            # Dataset loaders
│   │   ├── inference/           # Inference engines
│   │   └── export/              # Model export (e.g. TensorRT-LLM)
│   ├── training/                # Training scripts
│   ├── inference/               # Inference server
│   ├── legacy/                  # Legacy components
│   └── post_training/           # Post-training (RLHF, etc.)
├── examples/                    # Ready-to-use training examples
├── tools/                       # Utility tools
├── tests/                       # Comprehensive test suite
└── docs/                        # Documentation

<br>

System Requirements

Hardware Requirements

FP8 Support: NVIDIA Hopper, Ada, Blackwell GPUs
Recommended: NVIDIA Turing architecture or later

Software Requirements

CUDA/cuDNN/NCCL: Latest stable versions
PyTorch: Latest stable version
Transformer Engine: Latest stable version
Python: 3.12 recommended

<br>

Training

Data Preparation

MegaDLMs consumes tokenized corpus, so you have to tokenize your training/validation set in advance and store it somewhere. Below we show a quick example of how to tokenize the data. A tokenization script tools/preprocess_data.py is provided, which consumes .jsonl files as input, shown below.

E.g., if you prepare all your data into a .jsonl file and run tools/preprocess_data.py once to tokenize it with --output-prefix path/to/processed_data, you will get two tokenized files: path/to/processed_data.bin and path/to/processed_data.idx, where the .bin file store the tokenized ids and the .idx file store the positions of sequences.

You can also prepare your corpus into N .jsonl files and then tokenize them into N files by running tools/preprocess_data.py N times. Search for --data-path or --per-split-data-args-path in the megatron/training/arguments.py to learn more about how to use it in training.

JSONL Data Format

{"text": "Your training text here..."}
{"text": "Another training sample..."}

Basic Preprocessing

python tools/preprocess_data.py \
    --input data.jsonl \
    --output-prefix path/to/processed_data \
    --tokenizer-type HuggingFaceTokenizer \
    --tokenizer-model /path/to/tokenizer.model \
    --workers 8 \
    --append-eod

Key Arguments

--input: Path to input JSON/JSONL file
--output-prefix: Prefix for output binary files (.bin and .idx)
--tokenizer-type: Tokenizer type (HuggingFaceTokenizer, GPT2BPETokenizer, etc.)
--tokenizer-model: Path to tokenizer model file
--workers: Number of parallel workers for processing
--append-eod: Add end-of-document token

Train from Scratch

We provide examples of the whole pipeline for training DLMs from scratch. You can find them under examples/dlm_training.

Pre-train

source envs/.env; bash examples/dlm_training/dlm_pretrain_1.7b.sh

Find all training arguments in custom_args/difflm.py and megatron/training/arguments.py.

Checkpoint Conversion

source envs/.env; bash examples/dlm_training/ckpt_conversion.sh
source envs/.env; bash examples/dlm_training/ckpt_conversion_validation.sh

Generate with Your Trained DLM

source envs/.env; python examples/dlm_generation/dlm_inference.py

Train from a pre-trained HuggingFace checkpoint

We will provide an example soon. You can also try to set it up with the below changes, it's quite similar.

Convert the huggingface checkpoint to megatron format

tools/weights_conversion/hf_to_megatron_te.py

Samely, verify the precision with:

tools/weights_conversion/utils/verify_correctness_dlm.py

In training, you need to additionally specify the --load, --finetune, and --use-checkpoint-args as detailed in megatron/training/arguments.py.

<br>

Parallelism Strategies

Data Parallelism (DP)

Standard Data Parallel

# Standard DDP - replicate model on each GPU
torchrun --nproc_per_node=8 pretrain_difflm.py \
    --data-parallel-sharding-strategy no_shard

Fully Sharded Data Parallel (FSDP)

# Megatron's optimized FSDP (~15% faster than PyTorch FSDP2)
--use-custom-fsdp

# PyTorch FSDP2
--use-torch-fsdp2

# Sharding strategies
--data-parallel-sharding-strategy optim              # Shard optimizer states (ZeRO-1)
--data-parallel-sharding-strategy optim_grads        # Shard gradients + optimizer (ZeRO-2)
--data-parallel-sharding-strategy optim_grads_params # Shard parameters + gradients + optimizer (ZeRO-3)

Tensor Parallelism (TP)

Split individual model layers across GPUs:

--tensor-model-parallel-size 4  # 4-way tensor parallelism
--sequence-parallel             # Enable sequence parallelism (recommended with TP)

Pipeline Parallelism (PP)

Split model depth across GPUs:

--pipeline-model-parallel-size 8     # 8 pipeline stages
--virtual-pipeline-model-parallel-size 4  # Virtual pipeline for better load balancing

Context Parallelism (CP)

Split long sequences across GPUs for handling long contexts:

--context-parallel-size 2                    # 2-way context parallelism
--cp-comm-type p2p                          # Communication: p2p, a2a, allgather, a2a+p2p
--hierarchical-context-parallel-sizes 2 4   # Hierarchical context parallelism

Expert Parallelism (EP)

For Mixture of Experts (MoE) models:

--expert-model-parallel-size 4  # 4-way expert parallelism
--num-experts 8                 # 8 experts per MoE layer
--moe-grouped-gemm              # Optimize expert computation

Combining Parallelism Strategies

Parallelism Selection Guide

Based on NVIDIA NeMo production configurations:

| Model | Size | GPUs | TP | PP | CP | EP | Notes | |-------|------|------|----|----|----|----|-------| | LLama-3 | 8B | 8 | 1 | 1 | 2 | 1 | CP for long seqlen (8K) | | LLama-3 | 70B | 64 | 4 | 4 | 2 | 1 | TP+PP | | LLama-3.1 | 405B | 1024 | 8 | 8 | 2 | 1 | 3D parallelism for scale | | GPT-3 | 175B | 128-512 | 4 | 8 | 1 | 1 | Large model config | | Mixtral | 8x7B | 64 | 1 | 4 | 1 | 8 | EP for MoE | | Mixtral | 8x22B | 256 | 4 | 4 | 8 | 8 | Combined TP+EP for large MoE | | DeepSeek-V3 | 671B | 1024 | 2 | 16 | 1 | 64 | Large MoE config |

MoE-Specific Requirements

Important: When combining Expert Parallelism (EP) with Tensor Parallelism (TP), Sequence Parallelism (SP) must be enabled.

Performance Optimizations

| Feature | Flag | Benefit | |---------|------|---------|

MegaDLMs

Install / Use

README

MegaDLMs

Highlights

Latest News

Quick Start

1. Installation

2. Setup Envs

Project Structure

System Requirements

Hardware Requirements

Software Requirements

Training

Data Preparation

Train from Scratch

Train from a pre-trained HuggingFace checkpoint

Parallelism Strategies

Data Parallelism (DP)

Standard Data Parallel

Fully Sharded Data Parallel (FSDP)

Tensor Parallelism (TP)

Pipeline Parallelism (PP)

Context Parallelism (CP)

Expert Parallelism (EP)

Combining Parallelism Strategies

Parallelism Selection Guide

MoE-Specific Requirements

Performance Optimizations