SkillAgentSearch skills...

Tidyfast

Fast and efficient alternatives to tidyr functions built on data.table #rdatatable #rstats

Install / Use

/learn @TysonStanley/Tidyfast
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<!-- README.md is generated from README.Rmd. Please edit that file -->

tidyfast v0.4.0 <img src="man/figures/tidyfast_hex.png" align="right" width="30%" height="30%" />

<!-- badges: start -->

CRAN
status Lifecycle:
maturing Codecov test
coverage Downloads R-CMD-check

<!-- badges: end -->

Note: The expansion of dtplyr has made some of the functionality in tidyfast redundant. See dtplyr for a list of functions that are handled within that framework.

The goal of tidyfast is to provide fast and efficient alternatives to some tidyr (and a few dplyr) functions using data.table under the hood. Each have the prefix of dt_ to allow for autocomplete in IDEs such as RStudio. These should compliment some of the current functionality in dtplyr (but notably does not use the lazy_dt() framework of dtplyr). This package imports data.table and cpp11 (no other dependencies).

These are, in essence, translations from a more tidyverse grammar to data.table. Most functions herein are in places where, in my opinion, the data.table syntax is not obvious or clear. As such, these functions can translate a simple function call into the fast, efficient, and concise syntax of data.table.

The current functions include:

Nesting and unnesting (similar to dplyr::group_nest() and tidyr::unnest()):

  • dt_nest() for nesting data tables
  • dt_unnest() for unnesting data tables
  • dt_hoist() for unnesting vectors in a list-column in a data table

Pivoting (similar to tidyr::pivot_longer() and tidyr::pivot_wider())

  • dt_pivot_longer() for fast pivoting using data.table::melt()
  • dt_pivot_wider() for fast pivoting using data.table::dcast()

If Else (similar to dplyr::case_when()):

  • dt_case_when() for dplyr::case_when() syntax with the speed of data.table::fifelse()

Fill (similar to tidyr::fill())

  • dt_fill() for filling NA values with values before it, after it, or both. This can be done by a grouping variable (e.g. fill in NA values with values within an individual).

Count and Uncount (similar to tidyr::uncount() and dplyr::count())

  • dt_count() for fast counting by group(s)
  • dt_uncount() for creating full data from a count table

Separate (similar to tidyr::separate())

  • dt_separate() for splitting a single column into multiple based on a match within the column (e.g., column with values like “A.B” could be split into two columns by using the period as the separator where column 1 would have “A” and 2 would have “B”). It is built on data.table::tstrsplit(). This is not well tested yet and lacks some functionality of tidyr::separate().

Adjust data.table print options

  • dt_print_options() for adjusting the options for print.data.table()

General API

tidyfast attempts to convert syntax from tidyr with its accompanying grammar to data.table function calls. As such, we have tried to maintain the tidyr syntax as closely as possible without hurting speed and efficiency. Some more advanced use cases in tidyr may not translate yet. We try to be transparent about the shortcomings in syntax and behavior where known.

Each function that takes data (labeled as dt_ in the package docs) as its first argument automatically coerces it to a data table with as.data.table() if it isn’t already a data table. Each of these functions will return a data table.

Installation

You can install the stable version from CRAN with:

install.packages("tidyfast")

or you can install the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("TysonStanley/tidyfast")
#> ℹ Loading tidyfast

Examples

The initial versions of the nesting and unnesting functions were shown in a preprint. Herein is shown some simple applications and the functions’ speed/efficiency.

library(tidyfast)

Nesting and Unnesting

The following data table will be used for the nesting/unnesting examples.

set.seed(84322)

library(data.table)
library(dplyr)       # to compare with case_when()
library(tidyr)       # to compare with fill() and separate()
library(ggplot2)     # figures
library(ggbeeswarm)  # figures

dt <- data.table(
   x = rnorm(1e5),
   y = runif(1e5),
   grp = sample(1L:5L, 1e5, replace = TRUE),
   nested1 = lapply(1:10, sample, 10, replace = TRUE),
   nested2 = lapply(c("thing1", "thing2"), sample, 10, replace = TRUE),
   id = 1:1e5)

To make all the comparisons herein more equal, we will set the number of threads that data.table will use to 1.

setDTthreads(1)

We can nest this data using dt_nest():

nested <- dt_nest(dt, grp)
nested
#> Key: <grp>
#>      grp                  data
#>    <int>                <list>
#> 1:     1 <data.table[19638x5]>
#> 2:     2 <data.table[19987x5]>
#> 3:     3 <data.table[20033x5]>
#> 4:     4 <data.table[20269x5]>
#> 5:     5 <data.table[20073x5]>

We can also unnest this with dt_unnest():

dt_unnest(nested, col = data)
#> Key: <grp>
#>           grp          x           y               nested1
#>         <int>      <num>       <num>                <list>
#>      1:     1 -1.1813164 0.004599736       2,2,1,2,1,1,...
#>      2:     1 -1.0384420 0.853208540       2,8,4,6,7,7,...
#>      3:     1 -0.6247028 0.072652533       4,2,2,1,1,1,...
#>      4:     1 -1.3651514 0.569079215       1,1,1,3,6,2,...
#>      5:     1  0.1403744 0.864617284 10, 1, 1, 1, 8, 1,...
#>     ---                                                   
#>  99996:     5 -0.3437795 0.995197776       2,1,2,2,2,1,...
#>  99997:     5  1.6157744 0.241735719 10, 1, 1, 1, 8, 1,...
#>  99998:     5 -0.1321246 0.885283934       2,3,3,2,2,4,...
#>  99999:     5 -1.7019715 0.524621296       5,4,3,3,3,2,...
#> 100000:     5  0.3821493 0.032851280       2,8,4,6,7,7,...
#>                                               nested2    id
#>                                                <list> <int>
#>      1: thing2,thing2,thing2,thing2,thing2,thing2,...     2
#>      2: thing2,thing2,thing2,thing2,thing2,thing2,...     8
#>      3: thing1,thing1,thing1,thing1,thing1,thing1,...    15
#>      4: thing1,thing1,thing1,thing1,thing1,thing1,...    17
#>      5: thing2,thing2,thing2,thing2,thing2,thing2,...    20
#>     ---                                                    
#>  99996: thing1,thing1,thing1,thing1,thing1,thing1,... 99983
#>  99997: thing2,thing2,thing2,thing2,thing2,thing2,... 99990
#>  99998: thing2,thing2,thing2,thing2,thing2,thing2,... 99994
#>  99999: thing2,thing2,thing2,thing2,thing2,thing2,... 99996
#> 100000: thing2,thing2,thing2,thing2,thing2,thing2,... 99998
#>                          data
#>                        <list>
#>      1: <data.table[19638x5]>
#>      2: <data.table[19638x5]>
#>      3: <data.table[19638x5]>
#>      4: <data.table[19638x5]>
#>      5: <data.table[19638x5]>
#>     ---                      
#>  99996: <data.table[20073x5]>
#>  99997: <data.table[20073x5]>
#>  99998: <data.table[20073x5]>
#>  99999: <data.table[20073x5]>
#> 100000: <data.table[20073x5]>

When our list columns don’t have data tables (as output from dt_nest()) we can use the dt_hoist() function, that will unnest vectors. It keeps all the other variables that are not list-columns as well.

dt_hoist(dt, nested1, nested2)
#>                  x         y   grp     id nested1 nested2
#>              <num>     <num> <int>  <int>   <int>  <char>
#>       1: 0.1720703 0.3376675     2      1       1  thing1
#>       2: 0.1720703 0.3376675     2      1       1  thing1
#>       3: 0.1720703 0.3376675     2      1       1  thing1
#>       4: 0.1720703 0.3376675     2      1       1  thing1
#>       5: 0.1720703 0.3376675     2      1       1  thing1
#>      ---                                                 
#>  999996: 0.6268181 0.7851774     1 100000       1  thing2
#>  999997: 0.6268181 0.7851774     1 100000       5  thing2
#>  999998: 0.6268181 0.7851774     1 100000       7  thing2
#>  999999: 0.6268181 0.7851774     1 100000       6  thing2
#> 1000000: 0.6268181 0.7851774     1 100000       7  thing2

Speed comparisons (similar to those shown in the preprint) are highlighted below. Notably, the timings are without the nested1 and nested2 columns of the original dt object from above. Also, all dplyr and tidyr functions use a tbl version of the dt table.

<img src="man/figures/README-unnamed-chunk-9-1.png" width="70%" />
#> # A tibble: 2 × 3
#>   expression   median mem_alloc
#>   <chr>      <bch:tm> <bch:byt>
#> 1 dt_nest      1.14ms    2.88MB
#> 2 group_nest   1.91ms    5.12MB
#> # A tibble: 2 × 3
#>   expression   median mem_alloc
#>   <chr>      <bch:tm> <bch:byt>
#> 1 dt_unnest    2.08ms   11.84MB
#> 2 unnest       2.33ms    5.96MB

Pivoting

Thanks to @markfairbanks, we now have pivoting translations to data.table::melt() and data.table::dcast(). Consider the following example (similar to the example in tidyr::pivot_longer() and tidyr::pivot_wider()):

billboard <- tidyr::billboard

# note the warning - melt is telling us what 
#   it did with the various data types---logical (where there were just NAs
#   and numeric
longer
View on GitHub
GitHub Stars189
CategoryDevelopment
Updated6mo ago
Forks4

Languages

C++

Security Score

77/100

Audited on Oct 6, 2025

No findings