dsub: simple batch jobs with Docker

Overview

dsub is a command-line tool that makes it easy to submit and run batch scripts in the cloud.

The dsub user experience is modeled after traditional high-performance computing job schedulers like Grid Engine and Slurm. You write a script and then submit it to a job scheduler from a shell prompt on your local machine.

Today dsub supports Google Cloud as the backend batch job runner, along with a local provider for development and testing. With help from the community, we'd like to add other backends, such as a Grid Engine, Slurm, Amazon Batch, and Azure Batch.

Getting started

dsub is written in Python and requires Python 3.7 or higher.

The last version to support Python 3.6 was dsub 0.4.7.
For earlier versions of Python 3, use dsub 0.4.1.
For Python 2, use dsub 0.3.10.

Pre-installation steps

Create a Python virtual environment

This is optional, but whether installing from PyPI or from github, you are strongly encouraged to use a Python virtual environment.

You can do this in a directory of your choosing.

    python3 -m venv dsub_libs
    source dsub_libs/bin/activate

Using a Python virtual environment isolates dsub library dependencies from other Python applications on your system.

Activate this virtual environment in any shell session before running dsub. To deactivate the virtual environment in your shell, run the command:

    deactivate

Alternatively, a set of convenience scripts are provided that activate the virutalenv before calling dsub, dstat, and ddel. They are in the bin directory. You can use these scripts if you don't want to activate the virtualenv explicitly in your shell.

Install the Google Cloud SDK

While not used directly by dsub for the google-batch provider, you are likely to want to install the command line tools found in the Google Cloud SDK.

If you will be using the local provider for faster job development, you will need to install the Google Cloud SDK, which uses gsutil to ensure file operation semantics consistent with the Google dsub providers.

Install the Google Cloud SDK
Run
```
 gcloud init
```
gcloud will prompt you to set your default project and to grant credentials to the Google Cloud SDK.

Install `dsub`

Choose one of the following:

Install from PyPI

If necessary, install pip.
Install dsub
```
 pip install dsub
```

Install from github

Be sure you have git installed

Instructions for your environment can be found on the git website.

Clone this repository.

git clone https://github.com/DataBiosphere/dsub
cd dsub

Install dsub (this will also install the dependencies)
```
python -m pip install .
```
Set up Bash tab completion (optional).
```
source bash_tab_complete
```

Post-installation steps

Minimally verify the installation by running:
```
dsub --help
```
(Optional) Install Docker.

This is necessary only if you're going to create your own Docker images or use the local provider.

Makefile

After cloning the dsub repo, you can also use the Makefile by running:

    make

This will create a Python virtual environment and install dsub into a directory named dsub_libs.

Getting started with the local provider

We think you'll find the local provider to be very helpful when building your dsub tasks. Instead of submitting a request to run your command on a cloud VM, the local provider runs your dsub tasks on your local machine.

The local provider is not designed for running at scale. It is designed to emulate running on a cloud VM such that you can rapidly iterate. You'll get quicker turnaround times and won't incur cloud charges using it.

Run a dsub job and wait for completion.

Here is a very simple "Hello World" test:
```
 dsub \
   --provider local \
   --logging "${TMPDIR:-/tmp}/dsub-test/logging/" \
   --output OUT="${TMPDIR:-/tmp}/dsub-test/output/out.txt" \
   --command 'echo "Hello World" > "${OUT}"' \
   --wait
```
Note: TMPDIR is commonly set to /tmp by default on most Unix systems, although it is also often left unset. On some versions of MacOS TMPDIR is set to a location under /var/folders.

Note: The above syntax ${TMPDIR:-/tmp} is known to be supported by Bash, zsh, ksh. The shell will expand TMPDIR, but if it is unset, /tmp will be used.

View the output file.

 cat "${TMPDIR:-/tmp}/dsub-test/output/out.txt"

Getting started on Google Cloud

dsub currently supports the Batch API from Google Cloud.

google-batch is the current default provider.

The steps for getting started differ slightly as indicated in the steps below:

Sign up for a Google account and create a project.
Enable the APIs:
- For the batch API (provider: google-batch):
Enable the Batch, Storage, and Compute APIs.
Provide credentials so dsub can call Google APIs:
```
 gcloud auth application-default login
```
Create a Google Cloud Storage bucket.

The dsub logs and output files will be written to a bucket. Create a bucket using the storage browser or run the command-line utility gsutil, included in the Cloud SDK.
```
gsutil mb gs://my-bucket
```
Change my-bucket to a unique name that follows the bucket-naming conventions.

(By default, the bucket will be in the US, but you can change or refine the location setting with the -l option.)
Run a very simple "Hello World" dsub job and wait for completion.
- For the batch API (provider: google-batch):
```
  dsub \
    --provider google-batch \
    --project my-cloud-project \
    --regions us-central1 \
    --logging gs://my-bucket/logging/ \
    --output OUT=gs://my-bucket/output/out.txt \
    --command 'echo "Hello World" > "${OUT}"' \
    --wait
```
Change my-cloud-project to your Google Cloud project, and my-bucket to the bucket you created above.

The output of the script command will be written to the OUT file in Cloud Storage that you specify.

View the output file.

 gsutil cat gs://my-bucket/output/out.txt

Backend providers

Where possible, dsub tries to support users being able to develop and test locally (for faster iteration) and then progressing to running at scale.

To this end, dsub provides multiple "backend providers", each of which implements a consistent runtime environment. The current providers are:

local
google-batch (the default)

More details on the runtime environment implemented by the backend providers can be found in dsub backend providers.

Differences between `google-cls-v2` and `google-batch`

The google-cls-v2 provider is built on the Cloud Life Sciences v2beta API. This API is very similar to its predecessor, the Genomics v2alpha1 API. Details of the differences can be found in the Migration Guide.

The google-batch provider is built on the Cloud Batch API. Details of Cloud Life Sciences versus Batch can be found in this Migration Guide.

dsub largely hides the differences between the APIs, but there are a few differences to note:

google-batch requires jobs to run in one region

The --regions and --zones flags for dsub specify where the tasks should run. The google-cls-v2 allows you to specify a multi-region like US, multiple regions, or multiple zones across regions. With the google-batch provider, you must specify either one region or multiple zones within a single region.

`dsub` features

The following sections show how to run more complex jobs.

Defining what code to run

You can provide a shell command directly in the dsub command-line, as in the hello example above.

You can also save your script to a file, like hello.sh. Then you can run:

dsub \
    ... \
    --script hello.sh

If your script has dependencies that are not stored in your Docker image, you can transfer them to the local disk. See the instructions below for working with input and output files and folders.

Selecting a Docker image

To get started more easily, dsub uses a stock Ubuntu Docker image. This default image may change at any time in future releases, so for reproducible production work

Dsub

Install / Use

README

dsub: simple batch jobs with Docker

Overview

Getting started

Pre-installation steps

Create a Python virtual environment

Install the Google Cloud SDK

Install `dsub`

Install from PyPI

Install from github

Post-installation steps

Makefile

Getting started with the local provider

Getting started on Google Cloud

Backend providers

Differences between `google-cls-v2` and `google-batch`

`dsub` features

Defining what code to run

Selecting a Docker image

Dsub

Install / Use

README

dsub: simple batch jobs with Docker

Overview

Getting started

Pre-installation steps

Create a Python virtual environment

Install the Google Cloud SDK

Install dsub

Install from PyPI

Install from github

Post-installation steps

Makefile

Getting started with the local provider

Getting started on Google Cloud

Backend providers

Differences between google-cls-v2 and google-batch

dsub features

Defining what code to run

Selecting a Docker image

Install `dsub`

Differences between `google-cls-v2` and `google-batch`

`dsub` features