Drake
An R-focused pipeline toolkit for reproducibility and high-performance computing
Install / Use
/learn @ropensci/DrakeREADME
drake is superseded. Consider targets instead.
As of 2021-01-21, drake is superseded. The targets R package is the long-term successor of drake, and it is more robust and easier to use. Please visit https://books.ropensci.org/targets/drake.html for full context and advice on transitioning.
The drake R package <img src="https://docs.ropensci.org/drake/reference/figures/logo.svg" align="right" alt="logo" width="120" height = "139" style = "border: none; float: right;">
Data analysis can be slow. A round of scientific computation can take several minutes, hours, or even days to complete. After it finishes, if you update your code or data, your hard-earned results may no longer be valid. How much of that valuable output can you keep, and how much do you need to update? How much runtime must you endure all over again?
For projects in R, the drake package can help. It analyzes your
workflow, skips steps with
up-to-date results, and orchestrates the rest with optional distributed
computing. At the end,
drake provides evidence that your results match the underlying code
and data, which increases your ability to trust your research.
Video
That Feeling of Workflowing (Miles McBain)
<center> <a href="https://www.youtube.com/embed/jU1Zv21GvT4"> <img src="https://docs.ropensci.org/drake/reference/figures/workflowing.png" alt="workflowing" align="center" style = "border: none; float: center;"> </a> </center>(By Miles McBain; venue, resources)
rOpenSci Community Call
<center> <a href="https://ropensci.org/commcalls/2019-09-24/"> <img src="https://docs.ropensci.org/drake/reference/figures/commcall.png" alt="commcall" align="center" style = "border: none; float: center;"> </a> </center>What gets done stays done.
Too many data science projects follow a Sisyphean loop:
- Launch the code.
- Wait while it runs.
- Discover an issue.
- Rerun from scratch.
For projects with long runtimes, this process gets tedious. But with
drake, you can automatically
- Launch the parts that changed since last time.
- Skip the rest.
How it works
To set up a project, load your packages,
library(drake)
library(dplyr)
library(ggplot2)
library(tidyr)
#>
#> Attaching package: 'tidyr'
#> The following objects are masked from 'package:drake':
#>
#> expand, gather
load your custom functions,
create_plot <- function(data) {
ggplot(data) +
geom_histogram(aes(x = Ozone)) +
theme_gray(24)
}
check any supporting files (optional),
# Get the files with drake_example("main").
file.exists("raw_data.xlsx")
#> [1] TRUE
file.exists("report.Rmd")
#> [1] TRUE
and plan what you are going to do.
plan <- drake_plan(
raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
data = raw_data %>%
mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))),
hist = create_plot(data),
fit = lm(Ozone ~ Wind + Temp, data),
report = rmarkdown::render(
knitr_in("report.Rmd"),
output_file = file_out("report.html"),
quiet = TRUE
)
)
plan
#> # A tibble: 5 x 2
#> target command
#> <chr> <expr_lst>
#> 1 raw_data readxl::read_excel(file_in("raw_data.xlsx")) …
#> 2 data raw_data %>% mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TR…
#> 3 hist create_plot(data) …
#> 4 fit lm(Ozone ~ Wind + Temp, data) …
#> 5 report rmarkdown::render(knitr_in("report.Rmd"), output_file = file_out("re…
So far, we have just been setting the stage. Use make() or
r_make()
to do the real work. Targets are built in the correct order regardless
of the row order of plan.
make(plan) # See also r_make().
#> ▶ target raw_data
#> ▶ target data
#> ▶ target fit
#> ▶ target hist
#> ▶ target report
Except for files like report.html, your output is stored in a hidden
.drake/ folder. Reading it back is easy.
readd(data) # See also loadd().
#> # A tibble: 153 x 6
#> Ozone Solar.R Wind Temp Month Day
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 41 190 7.4 67 5 1
#> 2 36 118 8 72 5 2
#> 3 12 149 12.6 74 5 3
#> 4 18 313 11.5 62 5 4
#> 5 42.1 NA 14.3 56 5 5
#> 6 28 NA 14.9 66 5 6
#> 7 23 299 8.6 65 5 7
#> 8 19 99 13.8 59 5 8
#> 9 8 19 20.1 61 5 9
#> 10 42.1 194 8.6 69 5 10
#> # … with 143 more rows
You may look back on your work and see room for improvement, but it’s
all good! The whole point of drake is to help you go back and change
things quickly and painlessly. For example, we forgot to give our
histogram a bin width.
readd(hist)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
<!-- -->
So let’s fix the plotting function.
create_plot <- function(data) {
ggplot(data) +
geom_histogram(aes(x = Ozone), binwidth = 10) +
theme_gray(24)
}
drake knows which results are affected.
vis_drake_graph(plan) # See also r_vis_drake_graph().
<img src="https://docs.ropensci.org/drake/reference/figures/graph.png" alt="hist1" align="center" style = "border: none; float: center;" width = "600px">
The next make() just builds hist and report.html. No point in
wasting time on the data or model.
make(plan) # See also r_make().
#> ▶ target hist
#> ▶ target report
loadd(hist)
hist
<!-- -->
Reproducibility with confidence
The R community emphasizes reproducibility. Traditional themes include
scientific
replicability,
literate programming with knitr, and
version control with
git.
But internal consistency is important too. Reproducibility carries the
promise that your output matches the code and data you say you used.
With the exception of non-default
triggers and hasty
mode, drake
strives to keep this promise.
Evidence
Suppose you are reviewing someone else’s data analysis project for
reproducibility. You scrutinize it carefully, checking that the datasets
are available and the documentation is thorough. But could you re-create
the results without the help of the original author? With drake, it is
quick and easy to find out.
make(plan) # See also r_m
