SkillAgentSearch skills...

Makani

Massively parallel training of machine-learning based weather and climate models

Install / Use

/learn @NVIDIA/Makani
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Makani: Massively parallel training of machine-learning based weather and climate models

Overview | Getting started | More information | Contributing | Further reading | References

tests

Makani (the Hawaiian word for wind 🍃🌺) is a library designed to enable the research and development of the next generation of machine-learning (ML) based weather and climate models in PyTorch. Makani was used to train FourCastNet3 [1], Spherical Fourier Neural Operators (SFNO) [2] for weather (FourCastNet2), Huge ensemble of SFNO (HENS-SFNO) [3,4], and FourCastNet1 [5].

Makani is aimed at researchers working on ML based weather prediction. Stable features are frequently ported to the earth2studio and the NVIDIA PhysicsNeMo framework. For commercial and production purposes, we recommend checking out these packages.

<div align="center"> <img src="https://raw.githubusercontent.com/NVIDIA/makani/main/images/fcn3_ens15.gif" height="405px"> </div>

Overview

Makani is a research code developed by engineers and researchers at NVIDIA and NERSC for massively parallel training of weather and climate prediction models on 100+ GPUs and to enable the development of the next generation of weather and climate models. Makani is written in PyTorch and supports various forms of model- and data-parallelism, asynchronous loading of data, unpredicted channels, autoregressive training and much more. Makani is fully configurable through .yaml configuration files and support flexible development of novel models. Metrics, Losses and other components are designed in a modular fashion to support configurable, custom training- and inference-recipes at scale. Makani also supports scalable, fully online scoring modes, which are compatible with WeatherBench2. Among others, Makani was used to train the FourCastNet models, on the ERA5 dataset.

Getting started

Makani can be installed by running

git clone git@github.com:NVIDIA/makani.git
cd makani
pip install -e .

Training:

Makani supports ensemble and deterministic training. Ensemble training is launched by calling ensemble.py, whereas deterministic training is launched by calling train.py. Both scripts expect the CLI arguments to specify the configuration file --yaml_config and he configuration target --config, which is contained in the configuration file:

mpirun -np 8 --allow-run-as-root python -u train.py --yaml_config="config/fourcastnet3.yaml" --config="fcn3_sc2_edim45_layers10_pretrain1"

Makani supports various optimization to fit large models ino GPU memory and enable computationally efficient training. An overview of these features and corresponding CLI arguments is provided in the following table:

| Feature | CLI argument | options | |---------------------------|-----------------------------------------------|------------------------------| | Batch size | --batch_size | 1,2,3,... | | Ensemble size | --ensemble_size | 1,2,3,... | | Automatic Mixed Precision | --amp_mode | none, fp16, bf16 | | Just-in-time compilation | --jit_mode | none, script, inductor | | Activation checkpointing | --checkpointing_level | 0,1,2,3 | | Channel parallelism | --fin_parallel_size, --fout_parallel_size | 1,2,3,... | | Spatial model parallelism | --h_parallel_size, --w_parallel_size | 1,2,3,... | | Ensemble parallelism | --ensemble_parallel_size | 1,2,3,... | | Multistep training | --multistep_count | 1,2,3,... | | Skip training | --skip_training | | | Skip validation | --skip_validation | |

Especially larger models are enabled by using a mix of these techniques. Spatial model parallelism splits both the model and the data onto multiple GPUs, thus reducing both the memory footprint of the model and the load on the IO as each rank only needs to read a fraction of the data. A typical "large" training run of SFNO can be launched by running

mpirun -np 256 --allow-run-as-root python -u makani.train --amp_mode=bf16 --multistep_count=1 --run_num="ngpu256_sp4" --yaml_config="config/sfnonet.yaml" --config="sfno_linear_73chq_sc3_layers8_edim384_asgl2" --h_parallel_size=4 --w_parallel_size=1 --batch_size=64

Here we train the model on 256 GPUs, split horizontally across 4 ranks with a batch size of 64, which amounts to a local batch size of 1/4. Memory requirements are further reduced by the use of bf16 automatic mixed precision.

Inference:

Makani supports scalable and flexible on-line inference aimed at minimizing data movement and disk I/O, which is well suited to the low inference costs of ML weather models and modern HPC infrastructure. In a similar fashion to training, inference can be called from the CLI by calling inference.py and handled by inferencer.py. To launch inference on the out-of-sample dataset, we can call:

mpirun -np 256 --allow-run-as-root python -u makani.inference --run_num="ngpu256_sp4" --yaml_config="config/sfnonet.yaml" --config="sfno_linear_73chq_sc3_layers8_edim384_asgl2" --batch_size=64

By default, the inference script will perform inference on the out-of-sample dataset and compute the mtrics. The inference script supports model, data and ensemble parallelism out of the box, enabling efficient and scalable scoring. The inference script support additional CLI arguments which enable validation on a subset of the dataset, as well as writing out inferred states:

| Feature | CLI argument | options | |---------------------------|-----------------------------------------------|------------------------------| | Start date | --start_date | 2018-01-01+UTC00:00:00 | | End date | --end_date | 2018-12-31+UTC24:00:00 | | Date step (in hours) | --date_step | 1,2,... | | Output file | --output_file | file path for field outputs | | Output channels | --output_channels | channels to write out | | Metrics file | --metrics_file | file path for metrics output | | Bias file | --bias_file | file path for bias output | | Spectrum file | --spectrum_file | file path for spectra output |

More about Makani

Project structure

The project is structured as follows:

makani
├── ...
├── config                      # configuration files, also known as recipes
├── data_process                # data pre-processing such as computation of statistics
├── datasets                    # dataset utility scripts
├── docker                      # scripts for building a docker image for training
├── makani                      # Main directory containing the package
│   ├── inference               # contains the inferencer
│   ├── mpu                     # utilities for model parallelism
│   ├── networks                # networks, contains definitions of various ML models
│   ├── third_party/climt       # third party modules
│   │   └── zenith_angle.py     # computation of zenith angle
│   ├── utils                   # utilities
│   │   ├── dataloaders         # contains various dataloaders
│   │   ├── metrics             # metrics folder contains routines for scoring and benchmarking.
│   │   ├── ...
│   │   ├── comm.py             # comms module for orthogonal communicator infrastructure
│   │   ├── dataloader.py       # dataloader interface
│   │   ├── metric.py           # centralized metrics handler
│   │   ├── trainer_profile.py  # copy of trainer.py used for profiling
│   │   └── trainer.py          # main file for handling training
│   ├── ...
│   ├── inference.py            # CLI script for launching inference
│   ├── train.py                # CLI script for launching training
├── tests                       # test files
└── README.md                   # this file

Model and Training configuration

Model training in Makani is specified through the use of .yaml files located in the config folder. The corresponding models are located in modelf and registered in the model registry in models/model_registry.py. The following table lists the most important configuration options.

| Configuration Key | Description | Options | |---------------------------|---------------------------------------------------------|---------------------------------------------------------| | nettype | Network

Related Skills

View on GitHub
GitHub Stars363
CategoryEducation
Updated1d ago
Forks68

Languages

Python

Security Score

80/100

Audited on Mar 21, 2026

No findings