Reflow
A language and runtime for distributed, incremental data processing in the cloud
Install / Use
/learn @grailbio/ReflowREADME
Reflow is a system for incremental data processing in the cloud. Reflow enables scientists and engineers to compose existing tools (packaged in Docker images) using ordinary programming constructs. Reflow then evaluates these programs in a cloud environment, transparently parallelizing work and memoizing results. Reflow was created at GRAIL to manage our NGS (next generation sequencing) bioinformatics workloads on AWS, but has also been used for many other applications, including model training and ad-hoc data analyses.
Reflow comprises:
- a functional, lazy, type-safe domain specific language for writing workflow programs;
- a runtime for evaluating Reflow programs incrementally, coordinating cluster execution, and transparent memoization;
- a cluster scheduler to dynamically provision and tear down resources from a cloud provider (AWS currently supported).
Reflow thus allows scientists and engineers to write straightforward programs and then have them transparently executed in a cloud environment. Programs are automatically parallelized and distributed across multiple machines, and redundant computations (even across runs and users) are eliminated by its memoization cache. Reflow evaluates its programs incrementally: whenever the input data or program changes, only those outputs that depend on the changed data or code are recomputed.
In addition to the default cluster computing mode, Reflow programs can also be run locally, making use of the local machine's Docker daemon (including Docker for Mac).
Reflow was designed to support sophisticated, large-scale bioinformatics workflows, but should be widely applicable to scientific and engineering computing workloads. It was built using Go.
Reflow joins a long list of systems designed to tackle bioinformatics workloads, but differ from these in important ways:
- it is a vertically integrated system with a minimal set of external dependencies; this allows Reflow to be "plug-and-play": bring your cloud credentials, and you're off to the races;
- it defines a strict data model which is used for transparent memoization and other optimizations;
- it takes workflow software seriously: the Reflow DSL provides type checking, modularity, and other constructors that are commonplace in general purpose programming languages;
- because of its high level data model and use of caching, Reflow computes incrementally: it is always able to compute the smallest set of operations given what has been computed previously.
Table of Contents
- Quickstart - AWS
- Simple bioinformatics workflow
- 1000align
- Documentation
- Developing and building Reflow
- Debugging Reflow runs
- A note on Reflow's EC2 cluster manager
- Setting up a TaskDB
- Support and community
Getting Reflow
You can get binaries (macOS/amd64, Linux/amd64) for the latest release at the GitHub release page.
If you are developing Reflow, or would like to build it yourself, please follow the instructions in the section "Developing and building Reflow."
Quickstart - AWS
Reflow is distributed with an EC2 cluster manager, and a memoization
cache implementation based on S3. These must be configured before
use. Reflow maintains a configuration file in $HOME/.reflow/config.yaml
by default (this can be overridden with the -config option). Reflow's
setup commands modify this file directly. After each step, the current
configuration can be examined by running reflow config.
Note Reflow must have access to AWS credentials and configuration in the
environment (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION) while
running these commands.
% reflow setup-ec2
% reflow config
cluster: ec2cluster
ec2cluster:
securitygroup: <a newly created security group here>
maxpendinginstances: 5
maxhourlycostusd: 10
disktype: gp3
instancesizetodiskspace:
2xlarge: 300
3xlarge: 300
...
diskslices: 0
ami: ""
sshkey: []
keyname: ""
cloudconfig: {}
instancetypes:
- c3.2xlarge
- c3.4xlarge
...
reflow: reflowversion,version=<hash>
tls: tls,file=$HOME/.reflow/reflow.pem
After running reflow setup-ec2, we see that Reflow created a new
security group (associated with the account's default VPC), and
configured the cluster to use some default settings. Feel free to
edit the configuration file ($HOME/.reflow/config.yaml) to your
taste. If you want to use spot instances, add a new key under ec2cluster:
spot: true.
Reflow only configures one security group per account: Reflow will reuse
a previously created security group if reflow setup-ec2 is run anew.
See reflow setup-ec2 -help for more details.
Next, we'll set up a cache. This isn't strictly necessary, but we'll
need it in order to use many of Reflow's sophisticated caching and
incremental computation features. On AWS, Reflow implements a cache
based on S3 and DynamoDB. A new S3-based cache is provisioned by
reflow setup-s3-repository and reflow setup-dynamodb-assoc, each
of which takes one argument naming the S3 bucket and DynamoDB table
name to be used, respectively. The S3 bucket is used to store file
objects while the DynamoDB table is used to store associations
between logically named computations and their concrete output. Note
that S3 bucket names are global, so pick a name that's likely to be
unique.
% reflow setup-s3-repository reflow-quickstart-cache
reflow: creating s3 bucket reflow-quickstart-cache
reflow: created s3 bucket reflow-quickstart-cache
% reflow setup-dynamodb-assoc reflow-quickstart
reflow: creating DynamoDB table reflow-quickstart
reflow: created DynamoDB table reflow-quickstart
% reflow config
assoc: dynamodb,table=reflow-quickstart
repository: s3,bucket=reflow-quickstart-cache
<rest is same as before>
The setup commands created the S3 bucket and DynamoDB table as needed, and modified the configuration accordingly.
Advanced users can also optionally setup a taskdb.
We're now ready to run our first "hello world" program!
Create a file called "hello.rf" with the following contents:
val Main = exec(image := "ubuntu", mem := GiB) (out file) {"
echo hello world >>{{out}}
"}
and run it:
% reflow run hello.rf
reflow: run ID: 6da656d1
ec2cluster: 0 instances: (<=$0.0/hr), total{}, waiting{mem:1.0GiB cpu:1 disk:1.0GiB
reflow: total n=1 time=0s
ident n ncache transfer runtime(m) cpu mem(GiB) disk(GiB) tmp(GiB)
hello.Main 1 1 0B
a948904f
Here, Reflow started a new t2.small instance (Reflow matches the workload with
available instance types), ran echo hello world inside of an Ubuntu container,
placed the output in a file, and returned its SHA256 digest. (Reflow represents
file contents using their SHA256 digest.)
We're now ready to explore Reflow more fully.
Simple bioinformatics workflow
Let's explore some of Reflow's features through a simple task: aligning NGS read data from the 1000genomes project. Create a file called "align.rf" with the following. The code is commented inline for clarity.
// In order to align raw NGS data, we first need to construct an index
// against which to perform the alignment. We're going to be using
// the BWA aligner, and so we'll need to retrieve a reference sequence
// and create an index that's usable from BWA.
// g1kv37 is a human reference FASTA sequence. (All
// chromosomes.) Reflow has a static type system, but most type
// annotations can be omitted: they are inferred by Reflow. In this
// case, we're creating a file: a reference to the contents of the
// named URL. We're retrieving data from the public 1000genomes S3
// bucket.
val g1kv37 = file("s3://1000genomes/technical/reference/human_g1k_v37.fasta.gz")
// Here we create an indexed version of the g1kv37 reference. It is
// created using the "bwa index" command with the raw FASTA data as
// input. Here we encounter another way to produce data in reflow:
// the exec. An exec runs a (Bash) script inside of a Docker image,
// placing the output in files or directories (or both: execs can
// return multiple values). In this case, we're returning a
// directory since BWA stores multiple index files alongside the raw
// reference. We also declare that the image to be used is
// "biocontainers/bwa" (the BWA image maintained by the
// biocontainers project).
//
// Inside of an exec template (delimited by {" and "}) we refer to
// (interpolate) values in our environment by placing expressions
// inside of the {{ and }} delimiters. In this case we're referring
// to the file g1kv37 declared above, and our output, named out.
//
// Many types of expressions can be interpolated inside of an exec,
// for example strings, integers, files, and directories. Strings
// and integers are rendered using their normal representation,
// files and directories are materialized to a local path before
// starting execution. Thus, in this case, {{g1kv37}} is replaced at
// runtime by a path on disk with a file with the contents of the
// fi
