ClimateSet
A Large-Scale Climate Model Dataset for Machine Learning
Install / Use
/learn @RolnickLab/ClimateSetREADME
ClimateSet Emulation
This repository contains the code for running the climate model emulation benchmark experiments on the core ClimateSet data. Here we provide documentation on installation, setup, and a quickstart guide to reproduce our experiments and run your own experiments.
Important information:
- Website
- Publication on ArXiv
- Core dataset on HuggingFace
- Pre-trained models on HuggingFace
- Readthedocs (Contains some basic intro and explanations regarding climate modeling. More technical information will follow.)
- Dataset Extension Pipeline: Currently under active development. Python package expected to be released in January 2025.
This repository is currently under active development and you may encounter bugs with some functionality. Any feedback, extensions & suggestions are welcome!
Getting started
Downloading the core dataset
The preprocessed dataset is available on HuggingFace. You can opt to download the entire dataset or pick only specific climate models for targets. Please note that the core dataset entrails 1) two variables (precipitation (pr) & temperature (tas)), 2) 250 km nominal resolution, and 3) monthly data. This is the data that was used for the benchmarking. We will release code to preprocess other variables and other resolutions in a separate Python package and will update the HuggingFace data periodically.
HuggingFace
To download the entire dataset, you can make use of the provided Python script:
python scripts/download_climateset_huggingface.py
If you wish to download only specific climate model data, please refer to the instructions on HuggingFace.
Arbutus / DRAC
If you happen to be in Canada, you can also download the dataset via Arbutus (DRAC - Digital Research Alliance of Canada). Please note that this option is very slow for users located outside of Canada. We recommend this option mostly for users who are working directly on DRAC anyway.
1. Setting your dataset path
Set the path where you want your dataset to be downloaded in:
constants.pyANDscripts/download_climateset_arbutus.sh
2. Download the data via bash script
bash scripts/download_climateset_arbutus.sh
Please note that this by default only downloads NorESM2-LM data. To download data for all climate models, please uncomment the line with the for loop.
You should now see a newly created directory called "Climateset_DATA" containing inputs and targets. This folder will be referenced within the emulator pipeline.
Setting up the environment
To setup the environment for causalpaca, we use python>=3.10. There are two separate requirements file for creating environments.
To create the environment used for training UNet & ConvLSTM models, use requirements. For ClimaX related experiments, use requirements_climax.
Follow the following steps to create the environment for non-windows users:
Not Climax:
python -m venv env
source env/bin/activate
pip install -r requirements.txt
cd emulator
pip install -e .
Climax:
python -m venv env_climax
source env_climax/bin/activate
pip install -r requirements_climax.txt
cd emulator
pip install -e .
For Windows users do it in this manner:
python -m venv env_emulator
env_emulator/Scripts/activate
pip install -r requirements.txt
cd emulator
pip install -e .
For ClimaX: Download pre-trained checkpoints
To work with ClimaX, you will need to download the pre-trained checkpoints from the original release and place them in the correct folder. To do so, execute the following command:
bash scripts/download_climax_checkpoints.sh
Pythonpath
It might be the case that the Python variable has to be modified to contain the root folder of the ClimateSet project for the emulator to work.
Non-Windows users:
#input the path to the Climateset folder
export PYTHONPATH=/home/user/myproject # Or export PYTHONPATH=$pwd if you are in the right directory
#Check for success with:
echo $PYTHONPATH
Windows users:
#input the path to the Climateset folder
$env:PYTHONPATH = "home/user/myproject"
#Check for success with:
echo $env:PYTHONPATH
Running a model
Please note that you will have to run everything without a logger (including logger=none) at every command, or instead configure logging to your own wandb project or set up other logging methods. For that, please see the section on logging.
Train from scratch
To run the model, edit the main config to fit what you want to run. Executing the run.py script plain will use the main config.
The configs folder serves as a blueprint, listing all the modules available. To get a better understanding of our codebases structure please refer to the section on Structure
# starting inside the emulator folder:
python emulator/run.py logger=none # will run with configs/main_config.yml
IF you get an error telling you something like "No supported gpu backend found!": Install cuda and download torch with cuda enabled for the specific cuda version you downloaded (different for linux/windows users) something like:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
To execute one of the preset experiments or to run your own experiments you can create and pass on experiment configs:
python emulator/run.py experiment=test # will run whatever is specified by the configs/experiment/test.yml file
You can make use of the experiment template.
Reproducing experiments
We provide some experiment configs in emulator/configs/experiment to recreate some of our models.
We ran 3 different configurations of experiments:
- single emulator: A single ML-model - climate-model pairing.
- finetuning emulator: A single ML-model that was pre-trained on one climate model and fine-tuned on another.
- super emulator: A single ML-model that was trained on multiple climate models.
Single Emulator
Here are some examples to recreate single emulator experiments for NorESM2-LM.
python emulator/run.py experiment=single_emulator/unet/NorESM2-LM_unet_tas+pr_run-01.yaml logger=none seed=3423
This will train the U-Net model on NorESM2-LM dataset. To change some of the parameters of the experiment, you can use hydra to override them. For eg. to run with different experiment seed:
python emulator/run.py experiment=single_emulator/unet/NorESM2-LM_unet_tas+pr_run-01.yaml logger=none seed=22201
For running experiments with other models, here are some example commands:
python emulator/run.py experiment=single_emulator/climax/NorESM2-LM_climax_tas+pr_run-01.yaml logger=none seed=3423
python emulator/run.py experiment=single_emulator/climax_frozen/NorESM2-LM_climax_tas+pr_run-01.yaml logger=none seed=3423
python emulator/run.py experiment=single_emulator/convlstm/NorESM2-LM_convlstm_tas+pr_run-01.yaml logger=none seed=3423
For Climax & Climax_frozen models, we will need to use a different requirements file to create another environment.
If you run into a RuntimeError('Numpy is not available') with ClimaX, you must downgrade numpy to a version below 2, since the ClimaX module is compiled with numpy < 2. You can do this in your ClimaX emulator environment via:
pip install --upgrade numpy==1.26.4
Finetuning
For the single-emulator experiments, we provide configs for each ml model in emulator/configs/experiment/single_emulator and for fine-tuning experiments, the configs can be found in emulator/configs/experiment/finetuning_emulator.
For finetuning, we need to fill in pretrained_run_id and pretrained_ckpt_dir in the config files for resuming the experiments.
An example command for finetuning would look like this:
pythonemulator/run.py experiment=finetuning_emulator/climax/NorESM2-LM_FGOALS-f3-L_climax_tas+pr_run-01.yaml seed=3423 logger=none
Superemulator
For the superemulation experiments, we provide the configs of our experiments in emulator/configs/experiment/superemulator. Note that here, data loading is changed slightly to the super emulator infrastructure and a decoder must be set.
An example command to run a superemulator experiment would look like this:
Replace modelname.yaml with the respective model name (for example superemulator_climax_frozen_tas+pr_run-02), see emulator/configs/experiment/superemulator
python emulator/run.py experiment=superemulator/superemulator_climax_tas+pr_run-02.yaml seed=3423 logger=none
Reloading our trained models
We provide some of our trained models from the experiments, including only superemulator experiments and single_emulator experiments performed on NorESM2-LM data. Checkpoints from all our models accumulate a large amount of data and we are still working on making these available in a practical fashion. Please reach out to out if you wish to obtain pre-trained checkpoints from any other experiment not included in this subset.
Downloading pre-trained checkpoints
All our pre-trained models for the paper are hosted on HuggingFace. Please refer to the documentation there to download either all pre-trained models or only pick checkpoints for a spe
