Makani
Massively parallel training of machine-learning based weather and climate models
Install / Use
/learn @NVIDIA/MakaniREADME
Makani: Massively parallel training of machine-learning based weather and climate models
Overview | Getting started | More information | Contributing | Further reading | References
Makani (the Hawaiian word for wind 🍃🌺) is a library designed to enable the research and development of the next generation of machine-learning (ML) based weather and climate models in PyTorch. Makani was used to train FourCastNet3 [1], Spherical Fourier Neural Operators (SFNO) [2] for weather (FourCastNet2), Huge ensemble of SFNO (HENS-SFNO) [3,4], and FourCastNet1 [5].
Makani is aimed at researchers working on ML based weather prediction. Stable features are frequently ported to the earth2studio and the NVIDIA PhysicsNeMo framework. For commercial and production purposes, we recommend checking out these packages.
<div align="center"> <img src="https://raw.githubusercontent.com/NVIDIA/makani/main/images/fcn3_ens15.gif" height="405px"> </div>Overview
Makani is a research code developed by engineers and researchers at NVIDIA and NERSC for massively parallel training of weather and climate prediction models on 100+ GPUs and to enable the development of the next generation of weather and climate models. Makani is written in PyTorch and supports various forms of model- and data-parallelism, asynchronous loading of data, unpredicted channels, autoregressive training and much more. Makani is fully configurable through .yaml configuration files and support flexible development of novel models. Metrics, Losses and other components are designed in a modular fashion to support configurable, custom training- and inference-recipes at scale. Makani also supports scalable, fully online scoring modes, which are compatible with WeatherBench2. Among others, Makani was used to train the FourCastNet models, on the ERA5 dataset.
Getting started
Makani can be installed by running
git clone git@github.com:NVIDIA/makani.git
cd makani
pip install -e .
Training:
Makani supports ensemble and deterministic training. Ensemble training is launched by calling ensemble.py, whereas deterministic training is launched by calling train.py. Both scripts expect the CLI arguments to specify the configuration file --yaml_config and he configuration target --config, which is contained in the configuration file:
mpirun -np 8 --allow-run-as-root python -u train.py --yaml_config="config/fourcastnet3.yaml" --config="fcn3_sc2_edim45_layers10_pretrain1"
Makani supports various optimization to fit large models ino GPU memory and enable computationally efficient training. An overview of these features and corresponding CLI arguments is provided in the following table:
| Feature | CLI argument | options |
|---------------------------|-----------------------------------------------|------------------------------|
| Batch size | --batch_size | 1,2,3,... |
| Ensemble size | --ensemble_size | 1,2,3,... |
| Automatic Mixed Precision | --amp_mode | none, fp16, bf16 |
| Just-in-time compilation | --jit_mode | none, script, inductor |
| Activation checkpointing | --checkpointing_level | 0,1,2,3 |
| Channel parallelism | --fin_parallel_size, --fout_parallel_size | 1,2,3,... |
| Spatial model parallelism | --h_parallel_size, --w_parallel_size | 1,2,3,... |
| Ensemble parallelism | --ensemble_parallel_size | 1,2,3,... |
| Multistep training | --multistep_count | 1,2,3,... |
| Skip training | --skip_training | |
| Skip validation | --skip_validation | |
Especially larger models are enabled by using a mix of these techniques. Spatial model parallelism splits both the model and the data onto multiple GPUs, thus reducing both the memory footprint of the model and the load on the IO as each rank only needs to read a fraction of the data. A typical "large" training run of SFNO can be launched by running
mpirun -np 256 --allow-run-as-root python -u makani.train --amp_mode=bf16 --multistep_count=1 --run_num="ngpu256_sp4" --yaml_config="config/sfnonet.yaml" --config="sfno_linear_73chq_sc3_layers8_edim384_asgl2" --h_parallel_size=4 --w_parallel_size=1 --batch_size=64
Here we train the model on 256 GPUs, split horizontally across 4 ranks with a batch size of 64, which amounts to a local batch size of 1/4. Memory requirements are further reduced by the use of bf16 automatic mixed precision.
Inference:
Makani supports scalable and flexible on-line inference aimed at minimizing data movement and disk I/O, which is well suited to the low inference costs of ML weather models and modern HPC infrastructure. In a similar fashion to training, inference can be called from the CLI by calling inference.py and handled by inferencer.py. To launch inference on the out-of-sample dataset, we can call:
mpirun -np 256 --allow-run-as-root python -u makani.inference --run_num="ngpu256_sp4" --yaml_config="config/sfnonet.yaml" --config="sfno_linear_73chq_sc3_layers8_edim384_asgl2" --batch_size=64
By default, the inference script will perform inference on the out-of-sample dataset and compute the mtrics. The inference script supports model, data and ensemble parallelism out of the box, enabling efficient and scalable scoring. The inference script support additional CLI arguments which enable validation on a subset of the dataset, as well as writing out inferred states:
| Feature | CLI argument | options |
|---------------------------|-----------------------------------------------|------------------------------|
| Start date | --start_date | 2018-01-01+UTC00:00:00 |
| End date | --end_date | 2018-12-31+UTC24:00:00 |
| Date step (in hours) | --date_step | 1,2,... |
| Output file | --output_file | file path for field outputs |
| Output channels | --output_channels | channels to write out |
| Metrics file | --metrics_file | file path for metrics output |
| Bias file | --bias_file | file path for bias output |
| Spectrum file | --spectrum_file | file path for spectra output |
More about Makani
Project structure
The project is structured as follows:
makani
├── ...
├── config # configuration files, also known as recipes
├── data_process # data pre-processing such as computation of statistics
├── datasets # dataset utility scripts
├── docker # scripts for building a docker image for training
├── makani # Main directory containing the package
│ ├── inference # contains the inferencer
│ ├── mpu # utilities for model parallelism
│ ├── networks # networks, contains definitions of various ML models
│ ├── third_party/climt # third party modules
│ │ └── zenith_angle.py # computation of zenith angle
│ ├── utils # utilities
│ │ ├── dataloaders # contains various dataloaders
│ │ ├── metrics # metrics folder contains routines for scoring and benchmarking.
│ │ ├── ...
│ │ ├── comm.py # comms module for orthogonal communicator infrastructure
│ │ ├── dataloader.py # dataloader interface
│ │ ├── metric.py # centralized metrics handler
│ │ ├── trainer_profile.py # copy of trainer.py used for profiling
│ │ └── trainer.py # main file for handling training
│ ├── ...
│ ├── inference.py # CLI script for launching inference
│ ├── train.py # CLI script for launching training
├── tests # test files
└── README.md # this file
Model and Training configuration
Model training in Makani is specified through the use of .yaml files located in the config folder. The corresponding models are located in modelf and registered in the model registry in models/model_registry.py. The following table lists the most important configuration options.
| Configuration Key | Description | Options |
|---------------------------|---------------------------------------------------------|---------------------------------------------------------|
| nettype | Network
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
openclaw-plugin-loom
Loom Learning Graph Skill This skill guides agents on how to use the Loom plugin to build and expand a learning graph over time. Purpose - Help users navigate learning paths (e.g., Nix, German)
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
