Disk.frame
Fast Disk-Based Parallelized Data Manipulation Framework for Larger-than-RAM Data
Install / Use
/learn @DiskFrame/Disk.frameREADME
disk.frame <img src="inst/figures/disk.frame.png" align="right">
NOTICE
{disk.frame} has been soft-deprecated in favor of {arrow}. With the {arrow} 6.0.0 release, it’s now capable of doing larger-than-RAM data analysis quite well see release note. Hence, there is no strong reason to prefer {disk.frame} unless you have very specific feature needs.
For the above reason, I’ve decided to soft-deprecate {disk.frame} which means I will no longer actively develop new features for it but it will remain on CRAN in maintenance mode.
To help with the transition I’ve created a function,
disk.frame::disk.frame_to_parquet(df, outdir) to help you convert
existing {disk.frame}s to the parquet format so you can use {arrow} with
it.
I am working on an reincarnation of {disk.frame} in Julia, so the {disk.frame} will live on!
Thank your for support {disk.frame}. I’ve learnt alot along the way, but time has come to move on!
Introduction
How do I manipulate tabular data that doesn’t fit into Random Access Memory (RAM)?
Use {disk.frame}!
In a nutshell, {disk.frame} makes use of two simple ideas
- split up a larger-than-RAM dataset into chunks and store each chunk in a separate file inside a folder and
- provide a convenient API to manipulate these chunks
{disk.frame} performs a similar role to distributed systems such as
Apache Spark, Python’s Dask, and Julia’s JuliaDB.jl for medium data
which are datasets that are too large for RAM but not quite large enough
to qualify as big data.
Installation
You can install the released version of {disk.frame} from
CRAN with:
install.packages("disk.frame")
And the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("DiskFrame/disk.frame")
On some platforms, such as SageMaker, you may need to explicitly specify a repo like this
install.packages("disk.frame", repo="https://cran.rstudio.com")
Vignettes and articles
Please see these vignettes and articles about {disk.frame}
- Quick start:
{disk.frame}which replicates thesparklyrvignette for manipulating thenycflights13flights data. - Ingesting data into
{disk.frame}which lists some commons way of creating disk.frames {disk.frame}can be more epic! shows some ways of loading large CSVs and the importance ofsrckeep- Group-by the various types of group-bys
- Custom one-stage group-by functions how to define custom one-stage group-by functions
- Fitting GLMs (including logistic
regression) introduces the
dfglmfunction for fitting generalized linear models - Using data.table syntax with disk.frame
- disk.frame concepts
- Benchmark 1: disk.frame vs Dask vs JuliaDB
Common questions
a) What is {disk.frame} and why create it?
{disk.frame} is an R package that provides a framework for
manipulating larger-than-RAM structured tabular data on disk
efficiently. The reason one would want to manipulate data on disk is
that it allows arbitrarily large datasets to be processed by R. In other
words, we go from “R can only deal with data that fits in RAM” to “R can
deal with any data that fits on disk”. See the next section.
b) How is it different to data.frame and data.table?
A data.frame in R is an in-memory data structure, which means that R
must load the data in its entirety into RAM. A corollary of this is that
only data that can fit into RAM can be processed using data.frames.
This places significant restrictions on what R can process with minimal
hassle.
In contrast, {disk.frame} provides a framework to store and manipulate
data on the hard drive. It does this by loading only a small part of the
data, called a chunk, into RAM; process the chunk, write out the results
and repeat with the next chunk. This chunking strategy is widely applied
in other packages to enable processing large amounts of data in R, for
example, see chunkded
arkdb, and
iotools.
Furthermore, there is a row-limit of 2^31 for data.frames in R; hence
an alternate approach is needed to apply R to these large datasets. The
chunking mechanism in {disk.frame} provides such an avenue to enable
data manipulation beyond the 2^31 row limit.
c) How is {disk.frame} different to previous “big” data solutions for R?
R has many packages that can deal with larger-than-RAM datasets,
including ff and bigmemory. However, ff and bigmemory restrict
the user to primitive data types such as double, which means they do not
support character (string) and factor types. In contrast, {disk.frame}
makes use of data.table::data.table and data.frame directly, so all
data types are supported. Also, {disk.frame} strives to provide an API
that is as similar to data.frame’s where possible. {disk.frame}
supports many dplyr verbs for manipulating disk.frames.
Additionally, {disk.frame} supports parallel data operations using
infrastructures provided by the excellent future
package to take advantage of
multi-core CPUs. Further, {disk.frame} uses state-of-the-art data
storage techniques such as fast data compression, and random access to
rows and columns provided by the fst
package to provide superior data
manipulation speeds.
d) How does {disk.frame} work?
{disk.frame} works by breaking large datasets into smaller individual
chunks and storing the chunks in fst files inside a folder. Each chunk
is a fst file containing a data.frame/data.table. One can construct
the original large dataset by loading all the chunks into RAM and
row-bind all the chunks into one large data.frame. Of course, in
practice this isn’t always possible; hence why we store them as smaller
individual chunks.
{disk.frame} makes it easy to manipulate the underlying chunks by
implementing dplyr functions/verbs and other convenient functions
(e.g. the cmap(a.disk.frame, fn, lazy = F) function which applies the
function fn to each chunk of a.disk.frame in parallel). So that
{disk.frame} can be manipulated in a similar fashion to in-memory
data.frames.
e) How is {disk.frame} different from Spark, Dask, and JuliaDB.jl?
Spark is primarily a distributed system that also works on a single
machine. Dask is a Python package that is most similar to
{disk.frame}, and JuliaDB.jl is a Julia package. All three can
distribute work over a cluster of computers. However, {disk.frame}
currently cannot distribute data processes over many computers, and is,
therefore, single machine focused.
In R, one can access Spark via sparklyr, but that requires a Spark
cluster to be set up. On the other hand {disk.frame} requires
zero-setup apart from running install.packages("disk.frame") or
devtools::install_github("xiaodaigh/disk.frame").
Finally, Spark can only apply functions that are implemented for Spark,
whereas {disk.frame} can use any function in R including user-defined
functions.
Example usage
Set-up {disk.frame}
{disk.frame} works best if it can process multiple data chunks in
parallel. The best way to set-up {disk.frame} so that each CPU core
runs a background worker is by using
setup_disk.frame()
# this allows large datasets to be transferred between sessions
options(future.globals.maxSize = Inf)
The setup_disk.frame() sets up background workers equal to the number
of CPU cores; please note that, by default, hyper-threaded cores are
counted as one not two.
Alternatively, one may specify the number of workers using
setup_disk.frame(workers = n).
Quick-start
suppressPackageStartupMessages(library(disk.frame))
library(nycflights13)
# this will setup disk.frame's parallel backend with number of workers equal to the number of CPU cores (hyper-threaded cores are counted as one not two)
setup_disk.frame()
#> The number of workers available for disk.frame is 6
# this allows large datasets to be transferred between sessions
options(future.globals.maxSize = Inf)
# convert the flights data.frame to a disk.frame
# optionally, you may specify an outdir, otherwise, the
flights.df <- as.disk.frame(nycflights13::flights)
Example: dplyr verbs
dplyr verbs
{disk.frame} aims to support as many dplyr verbs as possible. For example
flights.df %>%
filter(year == 2013) %>%
mutate(origin_dest = paste0(origin, dest)) %>%
head(2)
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
#> 1: 2013 1 1 517 515 2 830 819 11
#> 2: 2013 1 1 533 529 4 850 830 20
#> carrier flight tailnum origin dest air_time distance hour minute time_hour
#> 1: UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
#> 2: UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
#> origin_dest
#> 1: EWRIAH
#> 2: LGAIAH
Group-by
Starting from {disk.frame} v0.3.0, there is group_by support f
