<h1 align="center"><img src="docs/images/minimax_logo.png" width="60%" /></h1> <h3 align="center"><i>Efficient baselines for autocurricula in JAX</i></h3> <p align="center"> <a href="https://opensource.org/licenses/Apache-2.0"><img src="https://img.shields.io/badge/license-Apache2.0-blue.svg"/></a> <a href="https://pypi.python.org/pypi/minimax-lib"><img src="https://badge.fury.io/py/minimax-lib.svg"/></a> <a href= "https://drive.google.com/drive/folders/15Vi7OsY6OrVaM5ZnY3Bt7J-0s5o_KV9b?usp=drive_link"><img src="https://colab.research.google.com/assets/colab-badge.svg"/></a> <a href="https://arxiv.org/abs/2311.12716"><img src="https://img.shields.io/badge/arXiv-2311.12716-b31b1b.svg"/></a> </p>

Why minimax?
- Hardware-accelerated baselines
Install
Quick start
Dive deeper
Environments
- Supported environments
- Adding environments
Agents
Roadmap
License
Citation

🐢 Why `minimax`?

Unsupervised Environment Design (UED) is a promising approach to generating autocurricula for training robust deep reinforcement learning (RL) agents. However, existing implementations of common baselines require excessive amounts of compute. In some cases, experiments can require more than a week to complete using V100 GPUs. This long turn-around slows the rate of research progress in autocuriculum methods. minimax provides fast, JAX-based implementations of key UED baselines, which are based on the concept of minimax regret. By making use of fully-tensorized environment implementations, minimax baselines are fully-jittable and thus take full advantage of the hardware acceleration offered by JAX. In timing studies done on V100 GPUs and Xeon E5-2698 v4 CPUs, we find minimax baselines can run over 100x faster than previous reference implementations, like those in facebookresearch/dcd.

All autocurriculum algorithms implemented in minimax support multi-device training, which can be activated through a single command line flag. Using multiple devices for training can lead to further speed ups and allows scaling these autocurriculum methods to much larger batch sizes.

🐇 Hardware-accelerated baselines

minimax includes JAX-based implementations of

Additionally, minimax includes two new variants of PLR and ACCEL that further reduce wall time by better leveraging the massive degree of environment parallelism enabled by JAX:

Parallel PLR (PLR$^{||}$)
Parallel ACCEL (ACCEL$^{||}$)

In brief, these two new algorithms collect rollouts for new level evaluation, level replay, and, in the case of Parallel ACCEL, mutation evaluation, all in parallel (i.e. rather than sequentially, as done by Robust PLR and ACCEL). As a simple example for why this parallelization improves wall time, consider how Robust PLR with replay probability of 0.5 would require approximately 2x as many rollouts in order to reach the same number of RL updates as a method like DR, because updates are only performed on rollouts based on level replay. Parallelizing level replay rollouts alongside new level evaluation rollouts by using 2x the environment parallelism reduces the total number of parallel rollouts to equal the total number of updates desired, thereby matching the 1:1 rollout to update ratio of DR. The diagram below summarizes this difference.

Parallel DCD overview

minimax includes a fully-tensorized implementation of a maze environment that we call AMaze. This environment exactly reproduces the MiniGrid-based mazes used in previous UED studies in terms of dynamics, reward function, observation space, and action space, while running many orders of magnitude faster in wall time, with increasing environment parallelism.

🛠️ Install

Use a virtual environment manager like conda or mamba to create a new environment for your project:

conda create -n minimax
conda activate minimax

Install minimax via either pip install minimax-lib or pip install ued.
That's it!

⚠️ Note that to enable hardware acceleration on GPU, you will need to make sure to install the latest version of jax>=0.4.19 and jaxlib>=0.4.19 that is compatible with your CUDA driver (requires minimum CUDA version of 11.8). See the official JAX installation guide for detailed instructions.

🏁 Quick start

The easiest way to get started is to play with the Python notebooks in the examples folder of this repository. We also host Colab versions of these notebooks:

DR [IPython, Colab]
PAIRED [IPython, Colab]
PLR and ACCEL*: [IPython, Colab]

*Depending on how the top-level flags are set, this notebook runs PLR, Robust PLR, Parallel PLR, ACCEL, or Parallel ACCEL.

minimax comes with high-performing hyperparameter configurations for several algorithms, including domain randomization (DR), PAIRED, PLR, and ACCEL for 60-block mazes. You can train using these settings by first creating the training command for executing minimax.train using the convenience script minimax.config.make_cmd:

python -m minimax.config.make_cmd --config maze/[dr,paired,plr,accel] | pbcopy,

followed by pasting and executing the resulting command into the command line.

See the docs for minimax.config.make_cmd to learn more about how to use this script to generate training commands from JSON configurations. You can browse the available JSON configurations for various autocurriculum methods in the configs folder.

Note that when logging and checkpointing are enabled, the main minimax.train script outputs this data as logs.csv and checkpoint.pkl respectively in an experiment directory located at <log_dir>/<xpid>, where log_dir and xpid are arguments specified in the command. You can then evaluate the checkpoint by using minimax.evaluate:

python -m minimax.evaluate \
--seed 1 \
--log_dir <absolute path log directory> \
--xpid_prefix <select checkpoints with xpids matching this prefix> \
--env_names <csv string of test environment names> \
--n_episodes <number of trials per test environment> \
--results_path <path to results folder> \
--results_fname <filename of output results csv>

🪸 Dive deeper

minimax system diagram

Training

The main entry for training is minimax.train. This script configures the training run based on command line arguments. It constructs an instance of ExperimentRunner to manage the training process on an update-cycle basis: These duties include constructing and delegating updates to an appropriate training runner for the specified autocurriculum algorithm and conducting logging and checkpointing. The training runner used by ExperimentRunner executes all autocurriculum-related logic. The system diagram above describes how these pieces fit together, as well as how minimax manages various, hierarchical batch dimensions.

Currently, minimax includes training runners for the following classes of autocurricula:

| Runner | Algorithm class | --train_runner | | -------------- | ------------------------------------------------------------- | -------------------- | | DRRunner | Domain randomization | dr | | PLRRunner | Replay-based curricula, including ACCEL | plr | | PAIREDRunner | Curricula via a co-adapting teacher environment design policy | paired |

The below table summarizes how various autocurriculum methods map to these runners and the key arguments that must be set differently from the default settings in order to switch the runner's behavior to each method.

| Algorithm | Reference | Runner | Key args | | ----------------- | --------------------------------------------------------------- | -------------- | ------------------------------------------------------------------------------ | | DR | Tobin et al, 2019 | DRRunner | – | | Minimax adversary | [Dennis et

Minimax

Install / Use

README

Contents

🐢 Why `minimax`?

🐇 Hardware-accelerated baselines

🛠️ Install

🏁 Quick start

🪸 Dive deeper

Training

Minimax

Install / Use

README

Contents

🐢 Why minimax?

🐇 Hardware-accelerated baselines

🛠️ Install

🏁 Quick start

🪸 Dive deeper

Training

🐢 Why `minimax`?