SkillAgentSearch skills...

Dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.

Install / Use

/learn @DataBiosphere/Dsub
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

dsub: simple batch jobs with Docker

License

Overview

dsub is a command-line tool that makes it easy to submit and run batch scripts in the cloud.

The dsub user experience is modeled after traditional high-performance computing job schedulers like Grid Engine and Slurm. You write a script and then submit it to a job scheduler from a shell prompt on your local machine.

Today dsub supports Google Cloud as the backend batch job runner, along with a local provider for development and testing. With help from the community, we'd like to add other backends, such as a Grid Engine, Slurm, Amazon Batch, and Azure Batch.

Getting started

dsub is written in Python and requires Python 3.7 or higher.

  • The last version to support Python 3.6 was dsub 0.4.7.
  • For earlier versions of Python 3, use dsub 0.4.1.
  • For Python 2, use dsub 0.3.10.

Pre-installation steps

Create a Python virtual environment

This is optional, but whether installing from PyPI or from github, you are strongly encouraged to use a Python virtual environment.

You can do this in a directory of your choosing.

    python3 -m venv dsub_libs
    source dsub_libs/bin/activate

Using a Python virtual environment isolates dsub library dependencies from other Python applications on your system.

Activate this virtual environment in any shell session before running dsub. To deactivate the virtual environment in your shell, run the command:

    deactivate

Alternatively, a set of convenience scripts are provided that activate the virutalenv before calling dsub, dstat, and ddel. They are in the bin directory. You can use these scripts if you don't want to activate the virtualenv explicitly in your shell.

Install the Google Cloud SDK

While not used directly by dsub for the google-batch provider, you are likely to want to install the command line tools found in the Google Cloud SDK.

If you will be using the local provider for faster job development, you will need to install the Google Cloud SDK, which uses gsutil to ensure file operation semantics consistent with the Google dsub providers.

  1. Install the Google Cloud SDK

  2. Run

     gcloud init
    

    gcloud will prompt you to set your default project and to grant credentials to the Google Cloud SDK.

Install dsub

Choose one of the following:

Install from PyPI

  1. If necessary, install pip.

  2. Install dsub

     pip install dsub
    

Install from github

  1. Be sure you have git installed

    Instructions for your environment can be found on the git website.

  2. Clone this repository.

    git clone https://github.com/DataBiosphere/dsub
    cd dsub
    
  3. Install dsub (this will also install the dependencies)

    python -m pip install .
    
  4. Set up Bash tab completion (optional).

    source bash_tab_complete
    

Post-installation steps

  1. Minimally verify the installation by running:

    dsub --help
    
  2. (Optional) Install Docker.

    This is necessary only if you're going to create your own Docker images or use the local provider.

Makefile

After cloning the dsub repo, you can also use the Makefile by running:

    make

This will create a Python virtual environment and install dsub into a directory named dsub_libs.

Getting started with the local provider

We think you'll find the local provider to be very helpful when building your dsub tasks. Instead of submitting a request to run your command on a cloud VM, the local provider runs your dsub tasks on your local machine.

The local provider is not designed for running at scale. It is designed to emulate running on a cloud VM such that you can rapidly iterate. You'll get quicker turnaround times and won't incur cloud charges using it.

  1. Run a dsub job and wait for completion.

    Here is a very simple "Hello World" test:

     dsub \
       --provider local \
       --logging "${TMPDIR:-/tmp}/dsub-test/logging/" \
       --output OUT="${TMPDIR:-/tmp}/dsub-test/output/out.txt" \
       --command 'echo "Hello World" > "${OUT}"' \
       --wait
    

    Note: TMPDIR is commonly set to /tmp by default on most Unix systems, although it is also often left unset. On some versions of MacOS TMPDIR is set to a location under /var/folders.

    Note: The above syntax ${TMPDIR:-/tmp} is known to be supported by Bash, zsh, ksh. The shell will expand TMPDIR, but if it is unset, /tmp will be used.

  2. View the output file.

     cat "${TMPDIR:-/tmp}/dsub-test/output/out.txt"
    

Getting started on Google Cloud

dsub currently supports the Batch API from Google Cloud.

google-batch is the current default provider.

The steps for getting started differ slightly as indicated in the steps below:

  1. Sign up for a Google account and create a project.

  2. Enable the APIs:

    • For the batch API (provider: google-batch):

    Enable the Batch, Storage, and Compute APIs.

  3. Provide credentials so dsub can call Google APIs:

     gcloud auth application-default login
    
  4. Create a Google Cloud Storage bucket.

    The dsub logs and output files will be written to a bucket. Create a bucket using the storage browser or run the command-line utility gsutil, included in the Cloud SDK.

    gsutil mb gs://my-bucket
    

    Change my-bucket to a unique name that follows the bucket-naming conventions.

    (By default, the bucket will be in the US, but you can change or refine the location setting with the -l option.)

  5. Run a very simple "Hello World" dsub job and wait for completion.

    • For the batch API (provider: google-batch):

        dsub \
          --provider google-batch \
          --project my-cloud-project \
          --regions us-central1 \
          --logging gs://my-bucket/logging/ \
          --output OUT=gs://my-bucket/output/out.txt \
          --command 'echo "Hello World" > "${OUT}"' \
          --wait
      

    Change my-cloud-project to your Google Cloud project, and my-bucket to the bucket you created above.

    The output of the script command will be written to the OUT file in Cloud Storage that you specify.

  6. View the output file.

     gsutil cat gs://my-bucket/output/out.txt
    

Backend providers

Where possible, dsub tries to support users being able to develop and test locally (for faster iteration) and then progressing to running at scale.

To this end, dsub provides multiple "backend providers", each of which implements a consistent runtime environment. The current providers are:

  • local
  • google-batch (the default)

More details on the runtime environment implemented by the backend providers can be found in dsub backend providers.

Differences between google-cls-v2 and google-batch

The google-cls-v2 provider is built on the Cloud Life Sciences v2beta API. This API is very similar to its predecessor, the Genomics v2alpha1 API. Details of the differences can be found in the Migration Guide.

The google-batch provider is built on the Cloud Batch API. Details of Cloud Life Sciences versus Batch can be found in this Migration Guide.

dsub largely hides the differences between the APIs, but there are a few differences to note:

  • google-batch requires jobs to run in one region

The --regions and --zones flags for dsub specify where the tasks should run. The google-cls-v2 allows you to specify a multi-region like US, multiple regions, or multiple zones across regions. With the google-batch provider, you must specify either one region or multiple zones within a single region.

dsub features

The following sections show how to run more complex jobs.

Defining what code to run

You can provide a shell command directly in the dsub command-line, as in the hello example above.

You can also save your script to a file, like hello.sh. Then you can run:

dsub \
    ... \
    --script hello.sh

If your script has dependencies that are not stored in your Docker image, you can transfer them to the local disk. See the instructions below for working with input and output files and folders.

Selecting a Docker image

To get started more easily, dsub uses a stock Ubuntu Docker image. This default image may change at any time in future releases, so for reproducible production work

View on GitHub
GitHub Stars277
CategoryDevelopment
Updated22d ago
Forks47

Languages

Python

Security Score

95/100

Audited on Mar 14, 2026

No findings