Safejoin
Wrappers around dplyr functions to join safely using various checks
Install / Use
/learn @moodymudskipper/SafejoinREADME
safejoin
The package safejoin features wrappers around packages dplyr and fuzzyjoin's functions to join safely using various checks. It also comes packed with features to select columns, rename them, operate on conflicting ones (coalesce for example), or aggregate the rhs on the joining columns before joining.
Install package with:
# install.packages(devtools)
devtools::install_github("moodymudskipper/safejoin")
Joining operations often come with tests, one might want to check that:
bycolumns are given explicitly (dplyr displays a message if they're not)- Factor columns used for the join have the same levels (dplyr displays a warning if they don't)
- No columns are repeated in both data.frames apart from
bycolumns (dplyr keeps them both and suffixes them silently) - Join columns form a unique key on both or either tables
- All rows of both or either tables will be matched
- All combinations of values of join columns are present on both or either sides
- columns used for joins have same class and type
This package provides the possibility to ignore, inform, warn or abort for any of combination of these cases.
These checks are handled by a single string parameter, i.e. a sequence of characters where uppercase letters trigger failures, lower case letters trigger warnings, and letters prefixed with ~ trigger messages, the codes are as follow:
"c"to check conflicts of columns"b"like "by" checks ifbyparameter was given explicitly"u"like unique to check that the join columns form an unique key onx"v"to check that the join columns form an unique key ony"m"like match to check that all rows ofxhave a match"n"to check that all rows ofyhave a match"e"like expand to check that all combinations of joining columns are present inx"f"to check that all combinations of joining columns are present iny"l"like levels to check that join columns are consistent in term of factor levels"t"like type to check that joining columns have same class and type
For example, check = "MN" will ensure that all rows of both tables are matched.
Additionally when identically named columns are present on both sides, we can aggregate them into one in flexible ways (including coalesce or just keeping one of them). This is done through the conflict parameter.
The package features functions safe_left_join, safe_right_join, safe_inner_join, safe_full_join, safe_nest_join, safe_semi_join, safe_anti_join, and eat.
The additional function, eat is designed to be an improved join in the cases where one is growing a data frame. In addition to the features above :
- It uses the
...argument to select columns from.yand leverages the select helpers from dplyr, allowing also things like renaming, negative selection, quasi-quotation... - It can prefix new columns or rename them in a flexible way
- It can summarize
.yon the fly along joining columns for more concise and readable code - It can join recursively to a list of tables
The support of fuzzyjoin functions is done in two ways, fuzzyjoin functions will be used instead of dplyr's functions if :
- The argument
match_funis filled. Then the standardfuzzyjoininterface is leveraged, except thatsafejoinsupports formula notation for this argument. - A formula argument is provided to the
byargument. It should use a notation like~ X("var1") > Y("var2") & X("var3") < Y("var4"). This was introduced to avoid using the argumentsmulti_byandmulti_match_funfromfuzzyjoin::fuzzy_joinwhich I felt were confusing, and have a single readable argument instead.
safe_left_join
safejoin offers the same features for all safe_*_join functions so we'll only review safe_left_join here, we also limit ourselves to checks of the form ~*
We'll use dplyr's data sets band_members and band_instruments along with extended versions.
library(safejoin)
library(dplyr,quietly = TRUE,warn.conflicts = FALSE)
band_members_extended <- band_members %>%
mutate(cooks = factor(c("pasta","pizza","spaghetti"),
levels = c("pasta","pizza","spaghetti"))) %>%
add_row(name = "John",band = "The Who", cooks = "pizza")
band_instruments_extended <- band_instruments %>%
mutate(cooks = factor(c("pizza","pasta","pizza")))
band_members
#> # A tibble: 3 x 2
#> name band
#> <chr> <chr>
#> 1 Mick Stones
#> 2 John Beatles
#> 3 Paul Beatles
band_instruments
#> # A tibble: 3 x 2
#> name plays
#> <chr> <chr>
#> 1 John guitar
#> 2 Paul bass
#> 3 Keith guitar
band_members_extended
#> # A tibble: 4 x 3
#> name band cooks
#> <chr> <chr> <fct>
#> 1 Mick Stones pasta
#> 2 John Beatles pizza
#> 3 Paul Beatles spaghetti
#> 4 John The Who pizza
band_instruments_extended
#> # A tibble: 3 x 3
#> name plays cooks
#> <chr> <chr> <fct>
#> 1 John guitar pizza
#> 2 Paul bass pasta
#> 3 Keith guitar pizza
Not applying any check :
safe_left_join(band_members,
band_instruments,
check = "")
#> # A tibble: 3 x 3
#> name band plays
#> <chr> <chr> <chr>
#> 1 Mick Stones <NA>
#> 2 John Beatles guitar
#> 3 Paul Beatles bass
Displaying "Joining, by..." like in default dplyr behavior:
safe_left_join(band_members,
band_instruments,
check = "~b")
#> Joining, by = "name"
#> # A tibble: 3 x 3
#> name band plays
#> <chr> <chr> <chr>
#> 1 Mick Stones <NA>
#> 2 John Beatles guitar
#> 3 Paul Beatles bass
Check column conflict when joining extended datasets by name:
try(safe_left_join(band_members_extended,
band_instruments_extended,
by = "name",
check = "~c"))
#> Conflict of columns: cooks
#> # A tibble: 4 x 5
#> name band cooks.x plays cooks.y
#> <chr> <chr> <fct> <chr> <fct>
#> 1 Mick Stones pasta <NA> <NA>
#> 2 John Beatles pizza guitar pizza
#> 3 Paul Beatles spaghetti bass pasta
#> 4 John The Who pizza guitar pizza
Check if x has unmatched combinations:
safe_left_join(band_members_extended,
band_instruments_extended,
by = c("name","cooks"),
check = "~m")
#> x has unmatched sets of joining values:
#> # A tibble: 2 x 2
#> name cooks
#> <chr> <chr>
#> 1 Mick pasta
#> 2 Paul spaghetti
#> # A tibble: 4 x 4
#> name band cooks plays
#> <chr> <chr> <chr> <chr>
#> 1 Mick Stones pasta <NA>
#> 2 John Beatles pizza guitar
#> 3 Paul Beatles spaghetti <NA>
#> 4 John The Who pizza guitar
Check if y has unmatched combinations:
safe_left_join(band_members_extended,
band_instruments_extended,
by = c("name","cooks"),
check = "~n")
#> y has unmatched sets of joining values:
#> # A tibble: 2 x 2
#> name cooks
#> <chr> <chr>
#> 1 Paul pasta
#> 2 Keith pizza
#> # A tibble: 4 x 4
#> name band cooks plays
#> <chr> <chr> <chr> <chr>
#> 1 Mick Stones pasta <NA>
#> 2 John Beatles pizza guitar
#> 3 Paul Beatles spaghetti <NA>
#> 4 John The Who pizza guitar
Check if x has absent combinations:
safe_left_join(band_members_extended,
band_instruments_extended,
by = c("name","cooks"),
check = "~e")
#> Some combinations of joining values are absent from x:
#> %s # A tibble: 6 x 2
#> name cooks
#> <chr> <chr>
#> 1 John pasta
#> 2 Paul pasta
#> 3 Mick pizza
#> 4 Paul pizza
#> 5 Mick spaghetti
#> 6 John spaghetti
#> # A tibble: 4 x 4
#> name band cooks plays
#> <chr> <chr> <chr> <chr>
#> 1 Mick Stones pasta <NA>
#> 2 John Beatles pizza guitar
#> 3 Paul Beatles spaghetti <NA>
#> 4 John The Who pizza guitar
Check if y has absent combinations:
safe_left_join(band_members_extended,
band_instruments_extended,
by = c("name","cooks"),
check = "~f")
#> Some combinations of joining values are absent from y:
#> %s # A tibble: 3 x 2
#> name cooks
#> <chr> <chr>
#> 1 Paul pizza
#> 2 John pasta
#> 3 Keith pasta
#> # A tibble: 4 x 4
#> name band cooks plays
#> <chr> <chr> <chr> <chr>
#> 1 Mick Stones pasta <NA>
#> 2 John Beatles pizza guitar
#> 3 Paul Beatles spaghetti <NA>
#> 4 John The Who pizza guitar
Check if x is unique on joining columns:
safe_left_join(band_members_extended,
band_instruments_extended,
by = c("name","cooks"),
check = "~u")
#> x is not unique on name and cooks
#> # A tibble: 4 x 4
#> name band cooks plays
#> <chr> <chr> <chr> <chr>
#> 1 Mick Stones pasta <NA>
#> 2 John Beatles pizza guitar
#> 3 Paul Beatles spaghetti <NA>
#> 4 John The Who pizza guitar
Check if y is unique on joining columns (it is):
safe_left_join(band_members_extended,
band_instruments_extended,
by = c("name","cooks"),
check = "~v")
#> # A tibble: 4 x 4
#> name band cooks plays
#> <chr> <chr> <chr> <chr>
#> 1 Mick Stones pasta <NA>
#> 2 John Beatles pizza guitar
#> 3 Paul Beatles spaghetti <NA>
#> 4 John The Who pizza guitar
``
Related Skills
node-connect
339.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.9kCommit, push, and open a PR
