Clusterduck
clusterduck is a hydra launcher plugin for running jobs in batches on a SLURM cluster. It is intended for small tasks on clusters where jobs have exclusive access to a node, such that submitting a single task to a node would be wasteful.
Install / Use
/learn @ALRhub/ClusterduckREADME
clusterduck
clusterduck is a hydra launcher plugin for running jobs in batches on a SLURM cluster. It is intended for small tasks on clusters where jobs have exclusive access to a node, such that submitting a single task to a node would be wasteful.
Installation
Install clusterduck with pip install .
pip install .
Developers should note that Hydra plugins are not compatible with new PEP 660-style editable installs. In order to perform an editable install, either use compatibility mode:
pip install -e . --config-settings editable_mode=compat
or use strict editable mode.
pip install -e . --config-settings editable_mode=strict
Be aware that strict mode installs do not expose new files created in the project until the installation is performed again.
Examples
The example script requires a few additional dependencies. Install with:
pip install ".[examples]"
To run the example script locally, e.g. looping over both model types twice each, use:
python example/train.py --multirun model=convnet,transformer +iteration="range(2)"
To run the example script with the submitit backend but locally without a cluster, specify the platform like this:
python example/train.py --multirun model=convnet,transformer +iteration="range(2)" +platform=slurm_debug
To run the example script on the HoreKa cluster, use:
python example/train.py --multirun model=convnet,transformer +iteration="range(2)" +platform=horeka
Configuration Options
This plugin is heavily inspired by the hydra-submitit-launcher plugin, and provides all parameters of that original plugin. See their documentation for details about those parameters.
Both plugins rely on submitit for the real heavy lifting. See their documentation for more information.
Additional Parameters
The following parameters are added by this plugin:
We refer to a hydra job, i.e. one execution of the hydra main function with a set of overrides, as a run, to differentiate it from both jobs and tasks as defined by SLURM.
- parallel_runs_per_node:
The number of parallel executions per node, i.e. the number of experiments that will run simultaneously in a single SLURM job. This will depend on the available resources in a node. - total_runs_per_node:
The total number of executions per node, i.e. the number of experiments that will run in a single SLURM job. This will depend on the duration of a run, theparallel_runs_per_nodesetting, and the time limit you set for the job in SLURM. If not specified, all executions will be run in a single job. However onlyparallel_runs_per_nodeof these executions will be running at any given time. - wait_for_completion:
If set to true, the launcher will keep running in your login node until all SLURM jobs have completed before exiting. Otherwise it will submit the SLURM jobs into the queue and then exit. - resources_config:
Any resources that must be divided up among parallel runs within a SLURM job. Currently available are following options configurable resources:- cpu Allocates CPUs evenly across parallel runs.
- Optional argument
cpusspecifies the CPU ids available to the job. Leave blank to auto-detect.
- Optional argument
- cuda Allocates GPUs for CUDA (e.g. Pytorch, TensorFlow, JAX, etc.) evenly across parallel runs.
- Optional argument
gpusspecifies the GPU ids available to the job. Leave blank to auto-detect.
- Optional argument
- rendering Allocates GPUs for headless rendering with EGL evenly across parallel runs. This is useful for e.g. image-based training on Mujoco environments, SOFA environments, headless rendering with pyglet, etc.
- Optional argument
gpusspecifies the GPU ids available to the job. Leave blank to auto-detect.
- Optional argument
- stagger: This will delay the start of each task by the specified amount of seconds. This can be useful if you want to avoid starting all tasks at the same time, e.g. to avoid overloading the file system.
- Argument
delayspecifies the delay amount in seconds.
- Argument
- cpu Allocates CPUs evenly across parallel runs.
- verbose If set to true, additional debug information will be printed to the SLURM job log (related to scheduling runs within a job and allocating resources), and to each hydra run log (related to setting up the resources for the run). If you are having difficulties with the plugin, setting this to true might help understand what is going on.
Here an example of a hydra/launcher config for Horeka that uses some of the above options:
hydra:
launcher:
# launcher/cluster specific options
timeout_min: 5
partition: accelerated
gres: gpu:4
setup:
# Create wandb folder in fast, job-local storage: https://www.nhr.kit.edu/userdocs/horeka/filesystems/#tmpdir
# NOTE: wandb folder will be deleted after job completion, but by then it will have synced with server
- export WANDB_DIR=$TMPDIR/wandb
- mkdir -pv $WANDB_DIR
- export WANDB_CONSOLE=off
# clusterduck specific options
parallel_runs_per_node: 4
total_runs_per_node: 8
resources_config:
cpu:
cuda:
rendering:
stagger:
delay: 5
Further look into the example folder for a working example with multiple example configurations.
Development
PyCUDA is a helpful tool for working with CUDA devices outside of the context of a machine learning library like pytorch. We recommend installing it with conda:
conda install pycuda
Install additional requirements for development using:
pip install ".[all]"
Other Sweepers
clusterduck plays nicely with other Hydra sweeper plugins, for example Optuna.
You can find a small example of how to use clusterduck with Optuna in example/conf/optim/optuna.yaml.
To run the example, install the additional dependencies with:
pip install hydra-optuna-sweeper
To run the example with the default Hydra launcher, run:
python example/train.py +optim=optuna
To run the example with clusterduck, run:
python example/train.py +optim=optuna_clusterduck +platform=slurm_debug
Related Skills
node-connect
352.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
