Megalodon
Reference implementation of Megalodon 7B model
Install / Use
/learn @XuezheMax/MegalodonREADME
Megalodon
Reference implementation of Megalodon 7B model.
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
Xuezhe Ma*, Xiaomeng Yang*, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou*
Discord: https://discord.gg/Unf8Fa7kWt
Updates
- [April 15th 2024] Release Repo to public.
Installation
First install PyTorch 2.0.1 with cuda 11.7
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia
Then, install the apex package.
# clone the repo
git clone https://github.com/NVIDIA/apex.git
cd apex
# checkout to the correct verison
git checkout 23.08
# complie & install
pip install packaging
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./
Install the fairscale package.
# clone the repo
git clone https://github.com/facebookresearch/fairscale.git
cd fairscale
# checkout to the correct branch for bf16
git checkout ngoyal_bf16_changes
# install
pip install .
Finally, install megalodon
https://github.com/XuezheMax/megalodon.git
cd megalodon
pip install -r requirements.txt
pip install -e .
Evaluating Pretrained LLMs
To launch an evaluation job, we recommend to use torchrun or slurm. We provide an example script of torchrun
export NGPU=<NUM_GPUS>; torchrun --nproc_per_node=$NGPU eval.py \
--model_parallel_size 1 \
--checkpoint_dir <Checkpoint Dir> \
--tokenizer_path <Tokenizer Path> \ # remove it if using origianl model tokenizer
--dump_dir <Dump Dir> \ # diretory where results are dumped
--dtype "bf16" \ # default is bf16
--master_logging_only "true" \ # only print logs from master
--ppl_files_str <PPL Files> \ # comma separated list of files to eval PPL
--prompt_path <Prompt File Path> \
--batch_size <Batch Size>
All the data should be prepared in jsonl format where the content texts are in the text field of each json item.
Launching a job for LLM pretraining
We provide a pseudo code for LLM pretraining
from logging import getLogger
from megalodon.logger import initialize_logger
from megalodon.distributed import (
init_signal_handler,
init_torch_distributed,
initialize_model_parallel,
)
from megalodon.model.mega import build_model
from megalodon.optim import build_optimizer
from megalodon.modules.losses import cross_entropy
from megalodon.utils import (
setup_env,
log_host,
clip_grad_norm_,
set_random_seed,
)
logger = getLogger()
initialize_logger()
setup_env()
log_host()
cfg = TrainerConf() # training config
init_signal_handler()
# initialize distributed mode / model parallel
logger.info("Starting init of torch.distributed...")
slurm_cfg = init_torch_distributed()
logger.info("Done init of torch.distributed.")
logger.info("Starting init of model parallel...")
initialize_model_parallel(cfg.model_parallel_size, cfg.chunk_parallel_size)
logger.info("Done init of model parallel.")
logger.info(
f"Global rank: {slurm_cfg.global_rank} -- "
f"model parallel rank: {cfg.model_parallel_rank}/{cfg.model_parallel_size} -- "
f"chunk parallel rank: {cfg.chunk_parallel_rank}/{cfg.chunk_parallel_size} -- "
f"data parallel rank: {cfg.data_parallel_rank}/{cfg.data_parallel_size}"
)
dataloader = DataLoader()
logger.info("Start building of model...")
model = build_model(cfg.model, dtype=cfg.dtype,
fp32_reduce_scatter=cfg.fp32_reduce_scatter,
reshard_after_forward=cfg.reshard_after_forward)
logger.info(model)
model.train()
# build optimizer / scheduler
optimizer, scheduler = build_optimizer(model, cfg.optim, cfg.steps, cfg.dtype)
for batch in dataloader:
x, y = batch
pred, _ = model(x) # forward pass
tok_loss = cross_entropy(pred, y)
loss = tok_loss.mean()
loss.backward() # backward pass
model.grad_all_reduce() # sync grad across each chunk parallel group
clip_grad_norm_(fsdp_module=model, max_norm=cfg.optim.clip) # grad clip
optimizer.step() # optimizer step
scheduler.step() # scheduler step
optimizer.zero_grad()
References
@misc{ma2024megalodon,
title={Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length},
author={Xuezhe Ma and Xiaomeng Yang and Wenhan Xiong and Beidi Chen and Lili Yu and Hao Zhang and Jonathan May and Luke Zettlemoyer and Omer Levy and Chunting Zhou},
year={2024},
eprint={2404.08801},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@inproceedings{
ma2023mega,
title={Mega: Moving Average Equipped Gated Attention},
author={Xuezhe Ma and Chunting Zhou and Xiang Kong and Junxian He and Liangke Gui and Graham Neubig and Jonathan May and Luke Zettlemoyer},
booktitle={The Eleventh International Conference on Learning Representations },
year={2023},
}
Related Skills
node-connect
338.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
338.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.6kCommit, push, and open a PR
