Collapse
Advanced and Fast Data Transformation in R
Install / Use
/learn @fastverse/CollapseREADME
collapse <img src='man/figures/logo.png' width="150px" align="right" />
<!-- badges: start -->
<!-- ?color=blue -->
<!-- ?color=blue -->
collapse is a large C/C++-based package for data transformation and statistical computing in R. It aims to:
- Facilitate complex data transformation, exploration and computing tasks in R.
- Help make R code fast, flexible, parsimonious and programmer friendly.
Its novel class-agnostic architecture supports all basic R objects and their popular extensions, including units, integer64, xts/zoo, tibble, grouped_df, data.table, sf, pseries and pdata.frame.
Key Features:
-
Advanced statistical programming: A full set of fast statistical functions supporting grouped and weighted computations on vectors, matrices and data frames. Fast and programmable grouping, ordering, matching, deduplication, factor generation and interactions.
-
Fast data manipulation: Fast and flexible functions for data manipulation, data object conversions and memory efficient R programming.
-
Advanced aggregation: Fast and easy multi-type, weighted and parallelized data aggregation.
-
Advanced transformations: Fast row/column arithmetic, (grouped) sweeping out of statistics (by reference), (grouped, weighted) scaling and (higher-dimensional) centering and averaging.
-
Advanced time-computations: Fast and flexible indexed time series and panel data classes, lags/leads, differences and (compound) growth rates on (irregular) time series and panels, panel-autocorrelation functions and panel data to array conversions.
-
List processing: Recursive list search, filtering, splitting, apply and unlisting to data frame.
-
Advanced data exploration: Fast (grouped, weighted, multi-level) descriptive statistical tools.
collapse is written in C and C++, with algorithms much faster than base R's, has extremely low evaluation overheads, scales well (benchmarks: linux | windows), and excels on complex statistical tasks. <!--, such as weighted statistics, mode/counting/deduplication, joins, pivots, panel data. Optimized R code ensures minimal evaluation overheads. , but imports C/C++ functions from *fixest*, *weights*, *RcppArmadillo*, and *RcppEigen* for certain statistical tasks. -->
Installation
# Install the current version on CRAN
install.packages("collapse")
# Install a stable development version (Windows/Mac binaries) from R-universe
install.packages("collapse", repos = "https://fastverse.r-universe.dev")
# Install a stable development version from GitHub (requires compilation)
remotes::install_github("fastverse/collapse")
# Install previous versions from the CRAN Archive (requires compilation)
install.packages("https://cran.r-project.org/src/contrib/Archive/collapse/collapse_2.0.19.tar.gz",
repos = NULL, type = "source")
# Older stable versions: 1.9.6, 1.8.9, 1.7.6, 1.6.5, 1.5.3, 1.4.2, 1.3.2, 1.2.1
Documentation
collapse installs with a built-in structured documentation, implemented via a set of separate help pages. Calling help('collapse-documentation') brings up the the top-level documentation page, providing an overview of the entire package and links to all other documentation pages.
In addition there are several vignettes, among them one on Documentation and Resources.
Cheatsheet
<a href="https://raw.githubusercontent.com/fastverse/collapse/master/misc/collapse%20cheat%20sheet/collapse_cheat_sheet.pdf"><img src="https://raw.githubusercontent.com/fastverse/collapse/master/misc/collapse%20cheat%20sheet/preview/page1.png" width="330"/></a> <!-- height="227" 294 --> <a href="https://raw.githubusercontent.com/fastverse/collapse/master/misc/collapse%20cheat%20sheet/collapse_cheat_sheet.pdf"><img src="https://raw.githubusercontent.com/fastverse/collapse/master/misc/collapse%20cheat%20sheet/preview/page2.png" width="330"/></a>
Article on arXiv
An article on collapse is forthcoming at Journal of Statistical Software.
Presentation at useR 2022
Example Usage
This provides a simple set of examples introducing some important features of collapse. It should be easy to follow for readers familiar with R.
<details> <summary><b><a style="cursor: pointer;">Click here to expand </a></b> </summary>library(collapse)
data("iris") # iris dataset in base R
v <- iris$Sepal.Length # Vector
d <- num_vars(iris) # Saving numeric variables (could also be a matrix, statistical functions are S3 generic)
g <- iris$Species # Grouping variable (could also be a list of variables)
## Advanced Statistical Programming -----------------------------------------------------------------------------
# Simple (column-wise) statistics...
fmedian(v) # Vector
fsd(qM(d)) # Matrix (qM is a faster as.matrix)
fmode(d) # data.frame
fmean(qM(d), drop = FALSE) # Still a matrix
fmax(d, drop = FALSE) # Still a data.frame
# Fast grouped and/or weighted statistics
w <- abs(rnorm(fnrow(iris)))
fmedian(d, w = w) # Simple weighted statistics
fnth(d, 0.75, g) # Grouped statistics (grouped third quartile)
fmedian(d, g, w) # Groupwise-weighted statistics
fsd(v, g, w) # Similarly for vectors
fmode(qM(d), g, w, ties = "max") # Or matrices (grouped and weighted maximum mode) ...
# A fast set of data manipulation functions allows complex piped programming at high speeds
library(magrittr) # Pipe operators
iris %>% fgroup_by(Species) %>% fndistinct # Grouped distinct value counts
iris %>% fgroup_by(Species) %>% fmedian(w) # Weighted group medians
iris %>% add_vars(w) %>% # Adding weight vector to dataset
fsubset(Sepal.Length < fmean(Sepal.Length), Species, Sepal.Width:w) %>% # Fast selecting and subsetting
fgroup_by(Species) %>% # Grouping (efficiently creates a grouped tibble)
fvar(w) %>% # Frequency-weighted group-variance, default (keep.w = TRUE)
roworder(sum.w) # also saves group weights in a column called 'sum.w'
# Can also use dplyr (but dplyr manipulation verbs are a lot slower)
library(dplyr)
iris %>% add_vars(w) %>%
filter(Sepal.Length < fmean(Sepal.Length)) %>%
select(Species, Sepal.Width:w) %>%
group_by(Species) %>%
fvar(w) %>% arrange(sum.w)
## Fast Data Manipulation ---------------------------------------------------------------------------------------
head(GGDC10S)
# Pivot Wider: Only SUM (total)
SUM <- GGDC10S |> pivot(c("Country", "Year"), "SUM", "Variable", how = "wider")
head(SUM)
# Joining with data from wlddev
wlddev |>
join(SUM, on = c("iso3c" = "Country", "year" = "Year"), how = "inner")
# Recast pivoting + supplying new labels for generated columns
pivot(GGDC10S, values = 6:16, names = list("Variable", "Sectorcode"),
labels = list(to = "Sector",
new = c(Sectorcode = "GGDC10S Sector Code",
Sector = "Long Sector Description",
VA = "Value Added",
EMP = "Employment")),
how = "recast", na.rm = TRUE)
## Advanced Aggregation -----------------------------------------------------------------------------------------
collap(iris, Sepal.Length + Sepal.Width ~ Species, fmean) # Simple aggregation using the mean..
collap(iris, ~ Species, list(fmean, fmedian, fmode)) # Multiple functions applied to eac
Related Skills
feishu-drive
341.0k|
things-mac
341.0kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
341.0kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
postkit
PostgreSQL-native identity, configuration, metering, and job queues. SQL functions that work with any language or driver
