Reflow

Reflow is a system for incremental data processing in the cloud. Reflow enables scientists and engineers to compose existing tools (packaged in Docker images) using ordinary programming constructs. Reflow then evaluates these programs in a cloud environment, transparently parallelizing work and memoizing results. Reflow was created at GRAIL to manage our NGS (next generation sequencing) bioinformatics workloads on AWS, but has also been used for many other applications, including model training and ad-hoc data analyses.

Reflow comprises:

a functional, lazy, type-safe domain specific language for writing workflow programs;
a runtime for evaluating Reflow programs incrementally, coordinating cluster execution, and transparent memoization;
a cluster scheduler to dynamically provision and tear down resources from a cloud provider (AWS currently supported).

Reflow thus allows scientists and engineers to write straightforward programs and then have them transparently executed in a cloud environment. Programs are automatically parallelized and distributed across multiple machines, and redundant computations (even across runs and users) are eliminated by its memoization cache. Reflow evaluates its programs incrementally: whenever the input data or program changes, only those outputs that depend on the changed data or code are recomputed.

In addition to the default cluster computing mode, Reflow programs can also be run locally, making use of the local machine's Docker daemon (including Docker for Mac).

Reflow was designed to support sophisticated, large-scale bioinformatics workflows, but should be widely applicable to scientific and engineering computing workloads. It was built using Go.

Reflow joins a long list of systems designed to tackle bioinformatics workloads, but differ from these in important ways:

it is a vertically integrated system with a minimal set of external dependencies; this allows Reflow to be "plug-and-play": bring your cloud credentials, and you're off to the races;
it defines a strict data model which is used for transparent memoization and other optimizations;
it takes workflow software seriously: the Reflow DSL provides type checking, modularity, and other constructors that are commonplace in general purpose programming languages;
because of its high level data model and use of caching, Reflow computes incrementally: it is always able to compute the smallest set of operations given what has been computed previously.

Quickstart - AWS
Simple bioinformatics workflow
1000align
Documentation
Developing and building Reflow
Debugging Reflow runs
A note on Reflow's EC2 cluster manager
Setting up a TaskDB
Support and community

Getting Reflow

You can get binaries (macOS/amd64, Linux/amd64) for the latest release at the GitHub release page.

If you are developing Reflow, or would like to build it yourself, please follow the instructions in the section "Developing and building Reflow."

Quickstart - AWS

Reflow is distributed with an EC2 cluster manager, and a memoization cache implementation based on S3. These must be configured before use. Reflow maintains a configuration file in $HOME/.reflow/config.yaml by default (this can be overridden with the -config option). Reflow's setup commands modify this file directly. After each step, the current configuration can be examined by running reflow config.

Note Reflow must have access to AWS credentials and configuration in the environment (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION) while running these commands.

% reflow setup-ec2
% reflow config
cluster: ec2cluster
ec2cluster:
  securitygroup: <a newly created security group here>
  maxpendinginstances: 5
  maxhourlycostusd: 10
  disktype: gp3
  instancesizetodiskspace:
    2xlarge: 300
    3xlarge: 300
    ...
  diskslices: 0
  ami: ""
  sshkey: []
  keyname: ""
  cloudconfig: {}
  instancetypes:
  - c3.2xlarge
  - c3.4xlarge
  ...
reflow: reflowversion,version=<hash>
tls: tls,file=$HOME/.reflow/reflow.pem

After running reflow setup-ec2, we see that Reflow created a new security group (associated with the account's default VPC), and configured the cluster to use some default settings. Feel free to edit the configuration file ($HOME/.reflow/config.yaml) to your taste. If you want to use spot instances, add a new key under ec2cluster: spot: true.

Reflow only configures one security group per account: Reflow will reuse a previously created security group if reflow setup-ec2 is run anew. See reflow setup-ec2 -help for more details.

Next, we'll set up a cache. This isn't strictly necessary, but we'll need it in order to use many of Reflow's sophisticated caching and incremental computation features. On AWS, Reflow implements a cache based on S3 and DynamoDB. A new S3-based cache is provisioned by reflow setup-s3-repository and reflow setup-dynamodb-assoc, each of which takes one argument naming the S3 bucket and DynamoDB table name to be used, respectively. The S3 bucket is used to store file objects while the DynamoDB table is used to store associations between logically named computations and their concrete output. Note that S3 bucket names are global, so pick a name that's likely to be unique.

% reflow setup-s3-repository reflow-quickstart-cache
reflow: creating s3 bucket reflow-quickstart-cache
reflow: created s3 bucket reflow-quickstart-cache
% reflow setup-dynamodb-assoc reflow-quickstart
reflow: creating DynamoDB table reflow-quickstart
reflow: created DynamoDB table reflow-quickstart
% reflow config
assoc: dynamodb,table=reflow-quickstart
repository: s3,bucket=reflow-quickstart-cache

<rest is same as before>

The setup commands created the S3 bucket and DynamoDB table as needed, and modified the configuration accordingly.

Advanced users can also optionally setup a taskdb.

We're now ready to run our first "hello world" program!

Create a file called "hello.rf" with the following contents:

val Main = exec(image := "ubuntu", mem := GiB) (out file) {"
	echo hello world >>{{out}}
"}

and run it:

% reflow run hello.rf
reflow: run ID: 6da656d1
	ec2cluster: 0 instances:  (<=$0.0/hr), total{}, waiting{mem:1.0GiB cpu:1 disk:1.0GiB
reflow: total n=1 time=0s
		ident      n   ncache transfer runtime(m) cpu mem(GiB) disk(GiB) tmp(GiB)
		hello.Main 1   1      0B

a948904f

Here, Reflow started a new t2.small instance (Reflow matches the workload with available instance types), ran echo hello world inside of an Ubuntu container, placed the output in a file, and returned its SHA256 digest. (Reflow represents file contents using their SHA256 digest.)

We're now ready to explore Reflow more fully.

Simple bioinformatics workflow

Let's explore some of Reflow's features through a simple task: aligning NGS read data from the 1000genomes project. Create a file called "align.rf" with the following. The code is commented inline for clarity.

// In order to align raw NGS data, we first need to construct an index
// against which to perform the alignment. We're going to be using
// the BWA aligner, and so we'll need to retrieve a reference sequence
// and create an index that's usable from BWA.

// g1kv37 is a human reference FASTA sequence. (All
// chromosomes.) Reflow has a static type system, but most type
// annotations can be omitted: they are inferred by Reflow. In this
// case, we're creating a file: a reference to the contents of the
// named URL. We're retrieving data from the public 1000genomes S3
// bucket.
val g1kv37 = file("s3://1000genomes/technical/reference/human_g1k_v37.fasta.gz")

// Here we create an indexed version of the g1kv37 reference. It is
// created using the "bwa index" command with the raw FASTA data as
// input. Here we encounter another way to produce data in reflow:
// the exec. An exec runs a (Bash) script inside of a Docker image,
// placing the output in files or directories (or both: execs can
// return multiple values). In this case, we're returning a
// directory since BWA stores multiple index files alongside the raw
// reference. We also declare that the image to be used is
// "biocontainers/bwa" (the BWA image maintained by the
// biocontainers project).
//
// Inside of an exec template (delimited by {" and "}) we refer to
// (interpolate) values in our environment by placing expressions
// inside of the {{ and }} delimiters. In this case we're referring
// to the file g1kv37 declared above, and our output, named out.
//
// Many types of expressions can be interpolated inside of an exec,
// for example strings, integers, files, and directories. Strings
// and integers are rendered using their normal representation,
// files and directories are materialized to a local path before
// starting execution. Thus, in this case, {{g1kv37}} is replaced at
// runtime by a path on disk with a file with the contents of the
// fi

Reflow

Install / Use

README

Table of Contents

Getting Reflow

Quickstart - AWS

Simple bioinformatics workflow