Replyr
Patches for using dplyr with Databases and Big Data
Install / Use
/learn @WinVector/ReplyrREADME
replyr is post-maintenance, we are no longer bug-fixing or updating
this package. It has been pointless to track shifting
dplyr/dbplyr/rlang APIs and data structures post dplyr 0.5.
Most of what it does is now done better in one of our newer
non-monolithic packages:
- Programming and meta-programming tools:
wrapr. - Big data data manipulation:
rqueryandcdata. - Adapting to standard evaluation interfaces:
seplyr.
This document describes replyr, an R
package available from Github and
CRAN.
Introduction
It comes as a bit of a shock for R
dplyr users when they
switch from using a tbl implementation based on R in-memory
data.frames to one based on a remote database or service. A lot of the
power and convenience of the dplyr notation is hard to maintain with
these more restricted data service providers. Things that work locally
can’t always be used remotely at scale. It is emphatically not yet the
case that one can practice with dplyr in one modality and hope to move
to another back-end without significant debugging and work-arounds. The
replyr package attempts to
provide practical data manipulation affordances to make code perform
similarly on local or remote (big) data.
Note: replyr is meant only for “tame data frames” that is data frames
with non-duplicate column names that are also valid simple (without
quotes) R variables names and columns that are R simple vector types
(numbers, strings, and such).

replyr supplies methods to get a grip on working with remote tbl
sources (SQL databases, Spark) through dplyr. The idea is to add
convenience functions to make such tasks more like working with an
in-memory data.frame. Results still do depend on which dplyr service
you use, but with replyr you have fairly uniform access to some useful
functions. The rule of thumb is: try dplyr first, and if that does not
work check if replyr has researched a work-around.
replyr uniformly uses standard or parametric interfaces (names of
variables as strings) in favor of name capture so that you can easily
program over replyr.
Primary replyr services include:
- Join Controller
- Join Planner
replyr::replyr_splitreplyr::replyr_bind_rowsreplyr::gapplyreplyr::replyr_summaryreplyr::replyr_apply_f_mappedwrapr::let
wrapr::let
wrapr::let allows execution of arbitrary code with substituted
variable names (note this is subtly different than binding values for
names as with base::substitute or base::with). This allows the user
to write arbitrary dplyr code in the case of “parametric variable
names”
(that is when variable names are not known at coding time, but will
become available later at run time as values in other variables) without
directly using the dplyr “underbar forms” (and the direct use of
lazyeval::interp, .dots=stats::setNames, or rlang/tidyeval).
Example:
library('dplyr')
# nice parametric function we write
ComputeRatioOfColumns <- function(d,
NumeratorColumnName,
DenominatorColumnName,
ResultColumnName) {
wrapr::let(
alias=list(NumeratorColumn=NumeratorColumnName,
DenominatorColumn=DenominatorColumnName,
ResultColumn=ResultColumnName),
expr={
# (pretend) large block of code written with concrete column names.
# due to the let wrapper in this function it will behave as if it was
# using the specified paremetric column names.
d %>% mutate(ResultColumn = NumeratorColumn/DenominatorColumn)
})
}
# example data
d <- data.frame(a=1:5, b=3:7)
# example application
d %>% ComputeRatioOfColumns('a','b','c')
# a b c
# 1 1 3 0.3333333
# 2 2 4 0.5000000
# 3 3 5 0.6000000
# 4 4 6 0.6666667
# 5 5 7 0.7142857
wrapr::let makes construction of abstract functions over dplyr
controlled data much easier. It is designed for the case where the
“expr” block is large sequence of statements and pipelines.
replyr::replyr_apply_f_mapped
wrapr::let was only the secondary proposal in the original 2016
“Parametric variable names”
article.
What we really wanted was a stack of view so the data pretended to have
names that matched the code (i.e., re-mapping the data, not the code).
With a bit of thought we can achieve this if we associate the data
re-mapping with a function environment instead of with the data. So a
re-mapping is active as long as a given controlling function is in
control. In our case that function is replyr::replyr_apply_f_mapped()
and works as follows:
Suppose the operation we wish to use is a rank-reducing function that
has been supplied as function from somewhere else that we do not have
control of (such as a package). The function could be simple such as the
following, but we are going to assume we want to use it without
alteration (including the without the small alteration of introducing
wrapr::let()).
# an external function with hard-coded column names
DecreaseRankColumnByOne <- function(d) {
d$RankColumn <- d$RankColumn - 1
d
}
To apply this function to d (which doesn’t have the expected column
names!) we use replyr::replyr_apply_f_mapped() to create a new
parametrized adapter as follows:
# our data
d <- data.frame(Sepal_Length = c(5.8,5.7),
Sepal_Width = c(4.0,4.4),
Species = 'setosa',
rank = c(1,2))
# a wrapper to introduce parameters
DecreaseRankColumnByOneNamed <- function(d, ColName) {
replyr::replyr_apply_f_mapped(d,
f = DecreaseRankColumnByOne,
nmap = c(RankColumn = ColName),
restrictMapIn = FALSE,
restrictMapOut = FALSE)
}
# use
dF <- DecreaseRankColumnByOneNamed(d, 'rank')
print(dF)
# Sepal_Length Sepal_Width Species rank
# 1 5.8 4.0 setosa 0
# 2 5.7 4.4 setosa 1
replyr::replyr_apply_f_mapped() renames the columns to the names
expected by DecreaseRankColumnByOne (the mapping specified in nmap),
applies DecreaseRankColumnByOne, and then inverts the mapping before
returning the value.
replyr::replyr_split
replyr::replyr_split and replyr::replyr_bind_rows work over many
remote data types including Spark. This allows code like the
following:
suppressPackageStartupMessages(library("dplyr"))
library("replyr")
sc <- sparklyr::spark_connect(version='2.0.2',
master = "local")
diris <- copy_to(sc, iris, 'diris')
f2 <- . %>%
arrange(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width) %>%
head(2)
diris %>%
replyr_split('Species') %>%
lapply(f2) %>%
replyr_bind_rows()
## Source: query [6 x 5]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## # A tibble: 6 x 5
## Species Sepal_Length Sepal_Width Petal_Length Petal_Width
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 versicolor 5.0 2.0 3.5 1.0
## 2 versicolor 4.9 2.4 3.3 1.0
## 3 setosa 4.3 3.0 1.1 0.1
## 4 setosa 4.4 2.9 1.4 0.2
## 5 virginica 4.9 2.5 4.5 1.7
## 6 virginica 5.6 2.8 4.9 2.0
sparklyr::spark_disconnect(sc)
replyr::gapply
replyr::gapply is a “grouped ordered apply” data operation. Many
calculations can be written in terms of this primitive, including
per-group rank calculation (assuming your data services supports window
functions), per-group summaries, and per-group selections. It is meant
to be a specialization of “The
Split-Apply-Combine”
strategy with all three steps wrapped into a single operator.
Example:
library('dplyr')
d <- data.frame(group=c(1,1,2,2,2),
order=c(.1,.2,.3,.4,.5))
rank_in_group <- . %>% mutate(constcol=1) %>%
mutate(rank=cumsum(constcol)) %>% select(-constcol)
d %>% replyr::gapply('group', rank_in_group, ocolumn='order', decreasing=TRUE)
# group order rank
# 1 1 0.2 1
# 2 1 0.1 2
# 3 2 0.5 1
# 4 2 0.4 2
# 5 2 0.3 3
The user supplies a function or pipeline that is meant to be applied
per-group and the replyr::gapply wrapper orchestrates the calculation.
In this example rank_in_group was assumed to know the column
