Dsub
Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Install / Use
/learn @DataBiosphere/DsubREADME
dsub: simple batch jobs with Docker
Overview
dsub is a command-line tool that makes it easy to submit and run batch scripts
in the cloud.
The dsub user experience is modeled after traditional high-performance
computing job schedulers like Grid Engine and Slurm. You write a script and
then submit it to a job scheduler from a shell prompt on your local machine.
Today dsub supports Google Cloud as the backend batch job runner, along with a
local provider for development and testing. With help from the community, we'd
like to add other backends, such as a Grid Engine, Slurm, Amazon Batch,
and Azure Batch.
Getting started
dsub is written in Python and requires Python 3.7 or higher.
- The last version to support Python 3.6 was
dsub0.4.7. - For earlier versions of Python 3, use
dsub0.4.1. - For Python 2, use
dsub0.3.10.
Pre-installation steps
Create a Python virtual environment
This is optional, but whether installing from PyPI or from github, you are strongly encouraged to use a Python virtual environment.
You can do this in a directory of your choosing.
python3 -m venv dsub_libs
source dsub_libs/bin/activate
Using a Python virtual environment isolates dsub library dependencies from
other Python applications on your system.
Activate this virtual environment in any shell session before running dsub.
To deactivate the virtual environment in your shell, run the command:
deactivate
Alternatively, a set of convenience scripts are provided that activate the
virutalenv before calling dsub, dstat, and ddel. They are in the
bin directory. You can
use these scripts if you don't want to activate the virtualenv explicitly in
your shell.
Install the Google Cloud SDK
While not used directly by dsub for the google-batch provider, you are likely to want to install the command line tools found in the Google
Cloud SDK.
If you will be using the local provider for faster job development,
you will need to install the Google Cloud SDK, which uses gsutil to ensure
file operation semantics consistent with the Google dsub providers.
-
Run
gcloud initgcloudwill prompt you to set your default project and to grant credentials to the Google Cloud SDK.
Install dsub
Choose one of the following:
Install from PyPI
-
If necessary, install pip.
-
Install
dsubpip install dsub
Install from github
-
Be sure you have git installed
Instructions for your environment can be found on the git website.
-
Clone this repository.
git clone https://github.com/DataBiosphere/dsub cd dsub -
Install dsub (this will also install the dependencies)
python -m pip install . -
Set up Bash tab completion (optional).
source bash_tab_complete
Post-installation steps
-
Minimally verify the installation by running:
dsub --help -
(Optional) Install Docker.
This is necessary only if you're going to create your own Docker images or use the
localprovider.
Makefile
After cloning the dsub repo, you can also use the Makefile by running:
make
This will create a Python virtual environment and install dsub into a
directory named dsub_libs.
Getting started with the local provider
We think you'll find the local provider to be very helpful when building
your dsub tasks. Instead of submitting a request to run your command on a
cloud VM, the local provider runs your dsub tasks on your local machine.
The local provider is not designed for running at scale. It is designed
to emulate running on a cloud VM such that you can rapidly iterate.
You'll get quicker turnaround times and won't incur cloud charges using it.
-
Run a
dsubjob and wait for completion.Here is a very simple "Hello World" test:
dsub \ --provider local \ --logging "${TMPDIR:-/tmp}/dsub-test/logging/" \ --output OUT="${TMPDIR:-/tmp}/dsub-test/output/out.txt" \ --command 'echo "Hello World" > "${OUT}"' \ --waitNote:
TMPDIRis commonly set to/tmpby default on most Unix systems, although it is also often left unset. On some versions of MacOS TMPDIR is set to a location under/var/folders.Note: The above syntax
${TMPDIR:-/tmp}is known to be supported by Bash, zsh, ksh. The shell will expandTMPDIR, but if it is unset,/tmpwill be used. -
View the output file.
cat "${TMPDIR:-/tmp}/dsub-test/output/out.txt"
Getting started on Google Cloud
dsub currently supports the Batch
API from Google Cloud.
google-batch is the current default provider.
The steps for getting started differ slightly as indicated in the steps below:
-
Sign up for a Google account and create a project.
-
Enable the APIs:
- For the
batchAPI (provider:google-batch):
- For the
-
Provide credentials so
dsubcan call Google APIs:gcloud auth application-default login -
Create a Google Cloud Storage bucket.
The dsub logs and output files will be written to a bucket. Create a bucket using the storage browser or run the command-line utility gsutil, included in the Cloud SDK.
gsutil mb gs://my-bucketChange
my-bucketto a unique name that follows the bucket-naming conventions.(By default, the bucket will be in the US, but you can change or refine the location setting with the
-loption.) -
Run a very simple "Hello World"
dsubjob and wait for completion.-
For the
batchAPI (provider:google-batch):dsub \ --provider google-batch \ --project my-cloud-project \ --regions us-central1 \ --logging gs://my-bucket/logging/ \ --output OUT=gs://my-bucket/output/out.txt \ --command 'echo "Hello World" > "${OUT}"' \ --wait
Change
my-cloud-projectto your Google Cloud project, andmy-bucketto the bucket you created above.The output of the script command will be written to the
OUTfile in Cloud Storage that you specify. -
-
View the output file.
gsutil cat gs://my-bucket/output/out.txt
Backend providers
Where possible, dsub tries to support users being able to develop and test
locally (for faster iteration) and then progressing to running at scale.
To this end, dsub provides multiple "backend providers", each of which
implements a consistent runtime environment. The current providers are:
- local
- google-batch (the default)
More details on the runtime environment implemented by the backend providers can be found in dsub backend providers.
Differences between google-cls-v2 and google-batch
The google-cls-v2 provider is built on the Cloud Life Sciences v2beta API.
This API is very similar to its predecessor, the Genomics v2alpha1 API.
Details of the differences can be found in the
Migration Guide.
The google-batch provider is built on the Cloud Batch API.
Details of Cloud Life Sciences versus Batch can be found in this
Migration Guide.
dsub largely hides the differences between the APIs, but there are a
few differences to note:
google-batchrequires jobs to run in one region
The --regions and --zones flags for dsub specify where the tasks should
run. The google-cls-v2 allows you to specify a multi-region like US,
multiple regions, or multiple zones across regions. With the google-batch
provider, you must specify either one region or multiple zones within a single
region.
dsub features
The following sections show how to run more complex jobs.
Defining what code to run
You can provide a shell command directly in the dsub command-line, as in the hello example above.
You can also save your script to a file, like hello.sh. Then you can run:
dsub \
... \
--script hello.sh
If your script has dependencies that are not stored in your Docker image, you can transfer them to the local disk. See the instructions below for working with input and output files and folders.
Selecting a Docker image
To get started more easily, dsub uses a stock Ubuntu Docker image.
This default image may change at any time in future releases, so for
reproducible production work
