Reward Conditioned Policies / Upside Down Reinforcement Learning

This is an open source library that replicates the results (except for RCP-A) from the papers: Reward Conditioned Policies (RCP) and Training Agents using Upside-Down Reinforcement Learning (UDRL) neither of which shared their implementations.

State of the Codebase:

gif

Example rollout of agent trained using UDRL for 540 epochs (54000 gradient update steps).

eval_mean

This code base works for LunarLander in that the agent will learn to achieve a high score across different random seeds. For UDRL, Thanks to Rupesh Srivastava, first author of Training Agents using Upside-Down Reinforcement Learning for his helpful correspondence in helping to reproduce their paper results robustly. For RCP, I believe the results for RCP-R have been replicated but the results for RCP-A are higher variance across seeds than in the paper. After brief correspondence with the authors of RCP, I remain unable to identify the bug or discrepancy in my code leading to these differences in performance.

Performance Comparisons:

UDRL in LunarLander environment in this codebase:

UDRL

X-axis is number of gradient updates (100 per epoch). 20 rollouts per epoch. If each rollout is approx 300 environment steps then multiply this X-axis by a factor of (20*300)/100 = 60 to get environment steps.

NB. I had other code running during this job so it should actually take closer to 10 hours.

Figure from paper:

UDRL-paper Average of 20 random seeds (5 in my case.) Appears to be replicated by my results.

UDRL in Sparse-LunarLander:

UDRL-sparse

Figure from paper:

UDRL-paper Average of 20 random seeds (5 in my case.) Appears to be replicated by my results.

RCP-R with exponential weighting:

RCP-R

Figure from paper:

RCP-R-paper

Average of 5 seeds. The X-axes are comparable here. Performance seems to match that of RCP-R and if anything looks somewhat more stable.

RCP-A with exponential weighting:

RCP-A

Figure from paper:

RCP-R-paper

Average of 5 seeds. The X-axes are comparable here. Performance either seems to match that of the figure RCP-A (seeds 27 and 28) or does much worse.

I am hoping by open-sourcing this codebase, the RL community will be able to improve upon it and collectively succeed in replicating RCP-A. Right now, if I had to guess, the problem is either a fundamental bug I have completely missed, or more likely some small but crucial implementation detail like a form of normalization or a certain hyperparameter setting eg the Beta value used in RCP.

Other Implementations:

There are a few other implementations of Upside Down Reinforcement Learning (UDRL) online already but these implementations either do not work or are very seed sensitive (see issues I have raised such as: here and here). This code base is not only more robust and capable of running multiple seeds in parallel for UDRL but is also the first implementation of Reward Conditioned Policies with both implmentations unified in a single code base (you can also easily mix and match components of each).

Relevant Scripts:

All experiments can be run on your local computer using CPUs for between five and fifteen hours of runtime depending on settings. Parallel processing is implemented to be able to run multple seeds at the same time but only where each seed uses one CPU. However, because of Pytorch Lightning and the Ray Tune implmementation, scaling up to more CPUs and GPUs is easy.

train.py - has almost all of the relevant configuration settings for the code. Also starts either ray tune (for hyperparam optimization) or a single model (for debugging). Able to switch between different model and learning types in a modular fashion
bash_train.sh - uses GNU parallel to run multiple seeds of a model. NB currently need to make the final directory of experiment results before running (else each worker will try to make the same directory)
lighting-trainer.py - meat of the code. Uses pytorch lightning for training
control/agent.py - runs rollouts of the environment and processes their rewards
envs/gym_params.py - provides environment specific parameters. NB! Take a careful look at these for what you are trying to do. For example, the average episode length is used for the UDRL buffer size so that it can store approximately the right number of rollouts.
exp_dir/ - contains all experiments separated by: environment_name/experiment_name/algorithm-implementation/seed/logged_versions
models/upsd_model.py - contains the Reward Conditioned Policies and Training Agents using Upside-Down Reinforcement Learning upside down models.
models/advantage_model.py - model to learn the advantage of actions, used for RCP-A.

Dependencies:

Tested with Python 3.7.5 (should work with Python 3.5 and higher).

Install Pytorch 1.7.0 (using CUDA or not depending on if you have a GPU) https://pytorch.org/get-started/locally/

If using Pip out of the box use: pip3 install -r RewardConditionedUDRL/requirements.txt

If using Conda then ensure pip is installed with conda and then run the same above code.

If you want to test running with multiple seeds then install GNU Parallel: sudo apt install parallel

Running the code:

To run a single model of the LunarLander with the UDRL implementation call:

python trainer.py --implementation UDRL --gamename lunarlander \                                                  
--exp_name debug \
--num_workers 1 --seed 25

Implementations are UDRL (Upside Down RL), RCP-R and RCP-A (Reward Conditioned Policies with Rewards and Advantages, respectively). For RCP the default is with exponential weighting rewards rather than advantages. For RCP exponential weighting is turned on by before you can use the flag --no_expo_weighting to turn it off. The Beta weighting value is inside trainer.py.

Environments that are currently supported are lunarlander and lunarlander-sparse. Where the sparse version gives all of the rewards at the very end.

To run multiple seeds call bash bash_train.sh changing the trainer.py settings and experiment name as is desired.

To run Ray hyperparameter tuning, uncomment all of the ray.tune() functions for desired hyperparamters to search over and use the flag --use_tune.

Checkpointing of model is turned on by default and will save a new version of the model whenever the policy loss declines (this is defined at approx line 194 for trainer.py). To turn off model checkpointing add the flag --no_checkpoint.

Multi-seed training. To test the model running multple seeds in parallel, modify the code in bash_train.sh which uses GNU parallel to run multiple seeds where each seed gets its own core to run on. The default is running seeds 25 to 29 inclusive. NB currently need to make the final directory of experiment results before running (else each worker will try to make the same directory)

To run more seeds it is advised to use Ray Tune and line approx. 181 of trainer.py can be used to define the seeds to be tried. Ray Tune provides performance of each agent but currentlyl lacks as granular information during training.

In order to record episodes from epochs, used the flag --recording_epoch_interval. Each epoch of this interval, record_n_rollouts_per_epoch (in config dict, default =1) will be saved out. However, to do this either you need to run a single seed on your local computer or have xvfb installed on your server (see below for in depth instructions on how to re-install GPU drivers that incorporate xvfb). The alternative is to ensure model checkpointing is turned on and render your saved models after training using the --eval_agent flag and providing it with the path to the trained model.

Model checkpointing is on by default and will save the model with the best performance on achieving mean rollout reward.

Evaluating Training:

All training results along with important metrics are saved out to Tensorboard. To view them call:

tensorboard --logdir RewardConditionedUDRL/exp_dir/*ENVIRONMENT_NAME*/*EXPERIMENT_NAME*/*IMPLEMENTATION-NAME*/

If you have run python trainer.py like in the above example (using seed 25 in the lunarlander environment and with debugging turned on) then the output can be seen by calling: tensorboard --logdir RewardConditionedUDRL/exp_dir/lunarlander/debug/UDRL/seed_25/logger/ and going to the URL generated. (Dont add the seed to the path if you are trying to view across seeds.)

To visualize the performance of a trained model, locate the model's checkpoint which will be under: exp_dir/*ENVIRONMENT_NAME*/*EXPERIMENT_NAME*/*SEED*/epoch=*VALUE*.ckpt and use the flag --eval_agent exp_dir/*ENVIRONMENT_NAME*/*EXPERIMENT_NAME*/*IMPLEMENTATION-NAME*/*SEED*/epoch=*VALUE*.ckpt.

Running Different Environments:

TODOs:

Nice to haves that either I (or you, reader!) will implement.

Update the PyTorch Lightning loggers and track eval_mean instead of the policy loss. https://github.com/PyTorchLightning/pytorch-lightning/issues/4584#issuecomment-724185789
Make bash_train.sh create the experiment results directory so each worker doesn't try to make the sam

RewardConditionedUDRL

Install / Use

README

Reward Conditioned Policies / Upside Down Reinforcement Learning

State of the Codebase:

Performance Comparisons:

Other Implementations:

Relevant Scripts:

Dependencies:

Running the code:

Evaluating Training:

Running Different Environments:

TODOs: