Dora The Explorer, a friendly experiment manager

Table of Content

Installation
Introduction
Making your code compatible with Dora
The dora command
dora run: Running XP locally
dora launch: Launching XP remotely
dora info: Inspecting an XP
dora grid: Managing a grid search
The Dora API
Sharing XPs
Advanced configuration
FAQ
Contributing

Installation

# For bleeding edge
pip install -U git+https://github.com/facebookincubator/submitit@main#egg=submitit
pip install -U git+https://git@github.com/facebookresearch/dora#egg=dora-search

# For stable release
pip install -U dora-search

What's up?

See the changelog for details on releases.

2022-06-09: version 0.1.10: adding HiPlot support ! Updated PL support, many small fixes.
2022-02-28: version 0.1.9
2021-12-10: version 0.1.8: see changelog, many of small changes.
2021-11-08: version 0.1.7: support for job arrays added.
2021-10-20: version 0.1.6 released, bug fixes.
2021-09-29: version 0.1.5 released.
2021-09-07: added support for a git_save option. This will ensure that the project git is clean and make a clone from which the experiment will run. This does not apply to dora run for easier debugging (but you can force it with --git_save).
2021-06-21: added support for Hydra 1.1. Be very careful if you update to Hydra 1.1, there are some non backward compatible changes in the way group config are parsed, see the Hydra release notes for more information.

(FB Only) If you are using Dora and want to receive updates on bug fixes and new versions, ping me (@defossez) on Workchat.

Introduction

Dora is an experiment launching tool which provides the following features:

Grid search management: automatic scheduling and canceling of the jobs to match what is specified in the grid search files. Grid search files are pure Python, and can contain arbitrary loops, conditions etc.
Deduplication: experiments are assigned a signature based on their arguments. If you ask twice for the same experiment to be ran, it won't be scheduled twice, but merged to the same run. If your code handles checkpointing properly, any previous run will be automatically resumed.
Monitoring: Dora supports basic monitoring from inside the terminal. You can customize the metrics to display in the monitoring table, and easily track progress, and compare runs in a grid search.

Some Dora concepts:

A Grid is a python file with an explorer function, wrapped in a dora.Explorer. The explorer function takes a dora.Launcher as argument. Call repeatidly the dora.Launcher with a set of hyper-parameters to schedule different experiments.
An XP is a specific experiment. Each experiment is defined by the arguments passed to the underlying experimental code, and is assigned a signature based on those arguments, for easy deduplication.
A signature is the unique XP identifier, derived from its arguments. You can use the signature to uniquely identity the XP across runs, and easily access logs, checkpoints etc.
A Sheep is the association of a Slurm/Submitit job, and an XP. Given an XP, it is always possible to retrieve the last Slurm job that was associated with it.

Making your code compatible with Dora

In order to derive the XP signature, Dora must know about the configuration schema your project is following, as well as the parsed arguments for a run. Dora supports two backends for that : argparse, and hydra. On top of that, Dora provides a smooth integration with Pytorch Lightning for projects that uses it.

In all cases, you must have a specific python package (which we will call here myproj), with a train module in it, (i.e. myproj.train module, stored in the myproj/train.py file.)

The train.py file must contain a main function that is properly decorated, as explained hereafter.

Argparse support

Here is a template for the train.py file:

import argparse
from dora import argparse_main, get_xp

parser = argparse.ArgumentParser("mycode.train")
...


@argparse_main(
    dir="./where_to_store_logs_and_checkpoints",
    parser=parser,
    exclude=["list_of_args_to_ignore_in_signature, e.g.", "num_workers",
             "can_be_pattern_*", "log_*"],
    use_underscore=True,  # flags are --batch_size vs. --batch-size
    git_save=False,  # if True, scheduled experiments will run from a separate clone of the repo.
)
def main():
    # No need to reparse args, you can directly access them from the current XP
    # object.
    xp = get_xp()
    xp.cfg # parsed arguments
    xp.sig  # signature for the current run
    xp.folder  # folder for the current run, please put your checkpoint relative
               # to this folder, so that it is automatically resumed!
    xp.link  # link object, can send back metrics to Dora

    # If you load a previous checkpoint, you should always make sure
    # That the Dora Link is consistent with what is in the checkpoint with
    # history = checkpoint['history']
    # xp.link.update_history(history)

    for t in range(10):
        xp.link.push_metrics({"loss": 1/(t + 1)})
    ...

Hydra support

The template for train.py:

from dora import hydra_main, get_xp


@hydra_main(
    config_path="./conf",  # path where the config is stored, relative to the parent of `mycode`.
    config_name="config"  # a file `config.yaml` should exist there.
)
def main(cfg):
    xp = get_xp()
    xp.cfg # parsed configuration
    xp.sig  # signature for the current run
    # Hydra run folder will automatically be set to xp.folder!

    xp.link  # link object, can send back metrics to Dora
    # If you load a previous checkpoint, you should always make sure
    # That the Dora Link is consistent with what is in the checkpoint with
    # history = checkpoint['history']
    # xp.link.update_history(history)

    for t in range(10):
        xp.link.push_metrics({"loss": 1/(t + 1)})
    ...

You can customize dora behavior from the config.yaml file, e.g.

my_config: plop
num_workers: 40
logs:
    interval: 10
    level: info

dora:
    exclude: ["num_workers", "logs.*"]
    dir: "./outputs"
    git_save: true  # set git_save option for the project.

PyTorch Lightning support

Deprecated: Due to a lack of internal use for PL, this only works with fairly old versions of PL. We are not planning on updating the support for PL.

Dora supports PyTorch Lightning (PL) out of the box. Dora will automatically capture logged metrics (make sure to use per_epoch=True), and handles distribution (you should not pass gpus=... or num_nodes=... to PL).

import dora.lightning


@dora.argparse_main(...)
def main():
    xp = dora.get_xp()
    args = xp.cfg
    # Replace Pytorch lightning `Trainer(...)` with the following:
    trainer = dora.lightning.get_trainer(...)
    # Or when using argparse parsing:
    trainer = dora.lightning.trainer_from_argparse_args(args)

See examples/pl/train.py for a full example including automatic reloading of the last checkpoint, logging etc.

Important: Dora deactivates the default PL behavior of dumping a mid-epoch checkpoint upon preemption, as this lead to non deterministic behavior (as PL would skip this epoch upon restart). Dora assumes you save checkpoints from time to time (e.g. every epoch). To get back the old behavior, pass no_unfinished_epochs=False to get_trainer. See examples/pl/train.py for an example of how to implement checkpointing in a reliable manner.

Distributed training support (non PyTorch Lightning)

Dora supports distributed training, and makes a few assumptions for you. You should initialize distributed training through Dora, by calling in your main function:

import dora.distrib
dora.distrib.init()

Note: This is not required for Pytorch Lightning users, see the PL section hereafter, everything will be setup automatically for you :)

Git Save

You can set the git_save option on your project, see hereafter on how to do it for either argparse or Hydra based projects. When this option is set, Dora makes individual clones of your project repository for each experiment that is scheduled. The job will then run from that clean clone. This allows both to keep track of the exact code that was used for an experiment, as well as preventing code changes to impact pending, or requeued jobs. If you reschedule a failed or cancelled job, the clone will however be updated with the current code.

In order to use this option, your code should be able to run from a fresh clone of the repository. If you need to access to resources that are specified with a path relative to the original repo, use dora.to_absolute_path(). Note that this is similar to hydra.utils.to_absolute_path(). In fact, you can safely replace the Hydra version with this one, as even when git_save is not set, the Dora one automatically falls back to the Hydra one (if Hydra is used).

The repository must be completely clean before scheduling

Dora

Install / Use

README