Dora The Explorer, a friendly experiment manager

This is a fork of facebookresearch/dora, as we diverged a bit from upstream.

Table of Content

Installation
Introduction
Making your code compatible with Dora
The dora command
dora run: Running XP locally
dora launch: Launching XP remotely
dora info: Inspecting an XP
dora grid: Managing a grid search
The Dora API
Sharing XPs
Advanced configuration
FAQ
Contributing

Installation

# For bleeding edge
pip install -U git+https://github.com/facebookincubator/submitit@main#egg=submitit
pip install -U git+https://git@github.com/0x53504852/dora#egg=dora-search

# For stable release
pip install -U dora-search

What's up?

See the changelog for details on releases.

2022-06-09: version 0.1.10: adding HiPlot support ! Updated PL support, many small fixes.
2022-02-28: version 0.1.9
2021-12-10: version 0.1.8: see changelog, many of small changes.
2021-11-08: version 0.1.7: support for job arrays added.
2021-10-20: version 0.1.6 released, bug fixes.
2021-09-29: version 0.1.5 released.
2021-09-07: added support for a git_save option. This will ensure that the project git is clean and make a clone from which the experiment will run. This does not apply to dora run for easier debugging (but you can force it with --git_save).
2021-06-21: added support for Hydra 1.1. Be very careful if you update to Hydra 1.1, there are some non backward compatible changes in the way group config are parsed, see the Hydra release notes for more information.

(FB Only) If you are using Dora and want to receive updates on bug fixes and new versions, ping me (@defossez) on Workchat.

Introduction

Dora is an experiment launching tool which provides the following features:

Grid search management: automatic scheduling and canceling of the jobs to match what is specified in the grid search files. Grid search files are pure Python, and can contain arbitrary loops, conditions etc.
Deduplication: experiments are assigned a signature based on their arguments. If you ask twice for the same experiment to be ran, it won't be scheduled twice, but merged to the same run. If your code handles checkpointing properly, any previous run will be automatically resumed.
Monitoring: Dora supports basic monitoring from inside the terminal. You can customize the metrics to display in the monitoring table, and easily track progress, and compare runs in a grid search.

Some Dora concepts:

A Grid is a python file with an explorer function, wrapped in a dora.Explorer. The explorer function takes a dora.Launcher as argument. Call repeatidly the dora.Launcher with a set of hyper-parameters to schedule different experiments.
An XP is a specific experiment. Each experiment is defined by the arguments passed to the underlying experimental code, and is assigned a signature based on those arguments, for easy deduplication.
A signature is the unique XP identifier, derived from its arguments. You can use the signature to uniquely identity the XP across runs, and easily access logs, checkpoints etc.
A Sheep is the association of a Slurm/Submitit job, and an XP. Given an XP, it is always possible to retrieve the last Slurm job that was associated with it.

Making your code compatible with Dora

In order to derive the XP signature, Dora must know about the configuration schema your project is following, as well as the parsed arguments for a run. Dora supports two backends for that : argparse, and hydra.

In all cases, you must have a specific python package (which we will call here myproj), with a train module in it, (i.e. myproj.train module, stored in the myproj/train.py file.)

The train.py file must contain a main function that is properly decorated, as explained hereafter.

Argparse support

Here is a template for the train.py file:

import argparse
from dora import argparse_main, get_xp

parser = argparse.ArgumentParser("mycode.train")
...


@argparse_main(
    dir="./where_to_store_logs_and_checkpoints",
    parser=parser,
    exclude=["list_of_args_to_ignore_in_signature, e.g.", "num_workers",
             "can_be_pattern_*", "log_*"],
    use_underscore=True,  # flags are --batch_size vs. --batch-size
    git_save=False,  # if True, scheduled experiments will run from a separate clone of the repo.
)
def main():
    # No need to reparse args, you can directly access them from the current XP
    # object.
    xp = get_xp()
    xp.cfg # parsed arguments
    xp.sig  # signature for the current run
    xp.folder  # folder for the current run, please put your checkpoint relative
               # to this folder, so that it is automatically resumed!
    xp.link  # link object, can send back metrics to Dora

    # If you load a previous checkpoint, you should always make sure
    # That the Dora Link is consistent with what is in the checkpoint with
    # history = checkpoint['history']
    # xp.link.update_history(history)

    for t in range(10):
        xp.link.push_metrics({"loss": 1/(t + 1)})
    ...

Hydra support

The template for train.py:

from dora import hydra_main, get_xp


@hydra_main(
    config_path="./conf",  # path where the config is stored, relative to the parent of `mycode`.
    config_name="config"  # a file `config.yaml` should exist there.
)
def main(cfg):
    xp = get_xp()
    xp.cfg # parsed configuration
    xp.sig  # signature for the current run
    # Hydra run folder will automatically be set to xp.folder!

    xp.link  # link object, can send back metrics to Dora
    # If you load a previous checkpoint, you should always make sure
    # That the Dora Link is consistent with what is in the checkpoint with
    # history = checkpoint['history']
    # xp.link.update_history(history)

    for t in range(10):
        xp.link.push_metrics({"loss": 1/(t + 1)})
    ...

You can customize dora behavior from the config.yaml file, e.g.

my_config: plop
num_workers: 40
logs:
    interval: 10
    level: info

dora:
    exclude: ["num_workers", "logs.*"]
    dir: "./outputs"
    git_save: true  # set git_save option for the project.

Distributed training support

Dora supports distributed training, and makes a few assumptions for you. You should initialize distributed training through Dora, by calling in your main function:

import dora.distrib
dora.distrib.init()

Git Save

You can set the git_save option on your project, see hereafter on how to do it for either argparse or Hydra based projects. When this option is set, Dora makes individual clones of your project repository for each experiment that is scheduled. The job will then run from that clean clone. This allows both to keep track of the exact code that was used for an experiment, as well as preventing code changes to impact pending, or requeued jobs. If you reschedule a failed or cancelled job, the clone will however be updated with the current code.

In order to use this option, your code should be able to run from a fresh clone of the repository. If you need to access to resources that are specified with a path relative to the original repo, use dora.to_absolute_path(). Note that this is similar to hydra.utils.to_absolute_path(). In fact, you can safely replace the Hydra version with this one, as even when git_save is not set, the Dora one automatically falls back to the Hydra one (if Hydra is used).

The repository must be completely clean before scheduling remote jobs, and all files should be either tracked or git ignored. This is very restricive, but this makes implementing this feature much simpler and safe. Also this forces good practice :) Only the dora run command can be used on a dirty repository, to allow for easy debugging. For the dora launch and dora grid command, you can also use the --no_git_save option to temporarily deactivate this feature.

The clone for each experiment is located inside the code/ subfolder inside the XP folder (which you can get with the dora info command for instance).

The `dora` command

Dora will install a dora command that is the main way to interact with it. The dora command defines 4 sub-commands, detailed in the following sections:

dora run: run training code locally (e.g. for debugging).
dora launch: launch remote jobs, useful for one-off experiments.
dora info: get information on a specific job/XP, logs etc.
dora grid: launch an entire grid search defined in a grid file. Only missing XP will be scheduled. Will also reports status and latest metrics.

In order for Dora to find your code, you must pass your training package (i.e. mycode) as dora -P mycode [run|launch|grid|info]. This flag can be skipped if mycode is in the current working directory and is the only folder with a train.py file in it, in which case Dora will find it automatically. You can also export DORA_PACKAGE=mycode to avoid having to give the -P flag explicitely.

You can

Dora

Install / Use

README

Dora The Explorer, a friendly experiment manager

Table of Content

Installation

What's up?

Introduction

Making your code compatible with Dora

Argparse support

Hydra support

Distributed training support

Git Save

The `dora` command

Dora

Install / Use

README

Dora The Explorer, a friendly experiment manager

Table of Content

Installation

What's up?

Introduction

Making your code compatible with Dora

Argparse support

Hydra support

Distributed training support

Git Save

The dora command

The `dora` command