Collapse

Advanced and Fast Data Transformation in R

Generate Convert Improve

Install / Use

/learn @fastverse/Collapse

About this skill

Quality Score

0/100

README

collapse <img src='man/figures/logo.png' width="150px" align="right" />

collapse is a large C/C++-based package for data transformation and statistical computing in R. It aims to:

Facilitate complex data transformation, exploration and computing tasks in R.
Help make R code fast, flexible, parsimonious and programmer friendly.

Its novel class-agnostic architecture supports all basic R objects and their popular extensions, including units, integer64, xts/zoo, tibble, grouped_df, data.table, sf, pseries and pdata.frame.

Key Features:

Advanced statistical programming: A full set of fast statistical functions supporting grouped and weighted computations on vectors, matrices and data frames. Fast and programmable grouping, ordering, matching, deduplication, factor generation and interactions.
Fast data manipulation: Fast and flexible functions for data manipulation, data object conversions and memory efficient R programming.
Advanced aggregation: Fast and easy multi-type, weighted and parallelized data aggregation.
Advanced transformations: Fast row/column arithmetic, (grouped) sweeping out of statistics (by reference), (grouped, weighted) scaling and (higher-dimensional) centering and averaging.
Advanced time-computations: Fast and flexible indexed time series and panel data classes, lags/leads, differences and (compound) growth rates on (irregular) time series and panels, panel-autocorrelation functions and panel data to array conversions.
List processing: Recursive list search, filtering, splitting, apply and unlisting to data frame.
Advanced data exploration: Fast (grouped, weighted, multi-level) descriptive statistical tools.

collapse is written in C and C++, with algorithms much faster than base R's, has extremely low evaluation overheads, scales well (benchmarks: linux | windows), and excels on complex statistical tasks.

Installation

# Install the current version on CRAN
install.packages("collapse")

# Install a stable development version (Windows/Mac binaries) from R-universe
install.packages("collapse", repos = "https://fastverse.r-universe.dev")

# Install a stable development version from GitHub (requires compilation)
remotes::install_github("fastverse/collapse")

# Install previous versions from the CRAN Archive (requires compilation)
install.packages("https://cran.r-project.org/src/contrib/Archive/collapse/collapse_2.0.19.tar.gz", 
                 repos = NULL, type = "source") 
# Older stable versions: 1.9.6, 1.8.9, 1.7.6, 1.6.5, 1.5.3, 1.4.2, 1.3.2, 1.2.1

Documentation

collapse installs with a built-in structured documentation, implemented via a set of separate help pages. Calling help('collapse-documentation') brings up the the top-level documentation page, providing an overview of the entire package and links to all other documentation pages.

In addition there are several vignettes, among them one on Documentation and Resources.

Cheatsheet

Article on arXiv

An article on collapse is forthcoming at Journal of Statistical Software.

Presentation at useR 2022

Video Recording | Slides

Example Usage

This provides a simple set of examples introducing some important features of collapse. It should be easy to follow for readers familiar with R.

<details> <summary><b><a style="cursor: pointer;">Click here to expand </a></b> </summary>

library(collapse)
data("iris")            # iris dataset in base R
v <- iris$Sepal.Length  # Vector
d <- num_vars(iris)     # Saving numeric variables (could also be a matrix, statistical functions are S3 generic)
g <- iris$Species       # Grouping variable (could also be a list of variables)

## Advanced Statistical Programming -----------------------------------------------------------------------------

# Simple (column-wise) statistics...
fmedian(v)                       # Vector
fsd(qM(d))                       # Matrix (qM is a faster as.matrix)
fmode(d)                         # data.frame
fmean(qM(d), drop = FALSE)       # Still a matrix
fmax(d, drop = FALSE)            # Still a data.frame

# Fast grouped and/or weighted statistics
w <- abs(rnorm(fnrow(iris)))
fmedian(d, w = w)                 # Simple weighted statistics
fnth(d, 0.75, g)                  # Grouped statistics (grouped third quartile)
fmedian(d, g, w)                  # Groupwise-weighted statistics
fsd(v, g, w)                      # Similarly for vectors
fmode(qM(d), g, w, ties = "max")  # Or matrices (grouped and weighted maximum mode) ...

# A fast set of data manipulation functions allows complex piped programming at high speeds
library(magrittr)                            # Pipe operators
iris %>% fgroup_by(Species) %>% fndistinct   # Grouped distinct value counts
iris %>% fgroup_by(Species) %>% fmedian(w)   # Weighted group medians 
iris %>% add_vars(w) %>%                     # Adding weight vector to dataset
  fsubset(Sepal.Length < fmean(Sepal.Length), Species, Sepal.Width:w) %>% # Fast selecting and subsetting
  fgroup_by(Species) %>%                     # Grouping (efficiently creates a grouped tibble)
  fvar(w) %>%                                # Frequency-weighted group-variance, default (keep.w = TRUE)  
  roworder(sum.w)                            # also saves group weights in a column called 'sum.w'

# Can also use dplyr (but dplyr manipulation verbs are a lot slower)
library(dplyr)
iris %>% add_vars(w) %>% 
  filter(Sepal.Length < fmean(Sepal.Length)) %>% 
  select(Species, Sepal.Width:w) %>% 
  group_by(Species) %>% 
  fvar(w) %>% arrange(sum.w)
  
## Fast Data Manipulation ---------------------------------------------------------------------------------------

head(GGDC10S)

# Pivot Wider: Only SUM (total)
SUM <- GGDC10S |> pivot(c("Country", "Year"), "SUM", "Variable", how = "wider")
head(SUM)

# Joining with data from wlddev
wlddev |>
    join(SUM, on = c("iso3c" = "Country", "year" = "Year"), how = "inner")

# Recast pivoting + supplying new labels for generated columns
pivot(GGDC10S, values = 6:16, names = list("Variable", "Sectorcode"),
      labels = list(to = "Sector",
                    new = c(Sectorcode = "GGDC10S Sector Code",
                            Sector = "Long Sector Description",
                            VA = "Value Added",
                            EMP = "Employment")), 
      how = "recast", na.rm = TRUE)

## Advanced Aggregation -----------------------------------------------------------------------------------------

collap(iris, Sepal.Length + Sepal.Width ~ Species, fmean)  # Simple aggregation using the mean..
collap(iris, ~ Species, list(fmean, fmedian, fmode))       # Multiple functions applied to eac