Groupdata2
R-package: Methods for dividing data into groups. Create balanced partitions and cross-validation folds. Perform time series windowing and general grouping and splitting of data. Balance existing groups with up- and downsampling or collapse them to fewer groups.
Install / Use
/learn @LudvigOlsen/Groupdata2README
groupdata2 <a href='https://github.com/LudvigOlsen/groupdata2'><img src='man/figures/groupdata2_logo_242x280_250dpi.png' align="right" height="140" /></a>
Author: Ludvig R. Olsen ( r-pkgs@ludvigolsen.dk ) <br/> License: MIT <br/> Started: October 2016
Overview
R package for dividing data into groups.
- Create balanced partitions and cross-validation folds.
- Perform time series windowing and general grouping and splitting of data.
- Balance existing groups with up- and downsampling.
- Collapse existing groups to fewer, balanced groups.
- Finds values, or indices of values, that differ from the previous value by some threshold(s).
- Check if two grouping factors have the same groups, memberwise.
Main functions
| Function | Description |
|:--------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| group_factor() | Divides data into groups by a wide range of methods. |
| group() | Creates grouping factor and adds to the given data frame. |
| splt() | Creates grouping factor and splits the data by these groups. |
| partition() | Splits data into partitions. Balances a given categorical variable and/or numerical variable between partitions and keeps all data points with a shared ID in the same partition. |
| fold() | Creates folds for (repeated) cross-validation. Balances a given categorical variable and/or numerical variable between folds and keeps all data points with a shared ID in the same fold. |
| collapse_groups() | Collapses existing groups into a smaller set of groups with categorical, numerical, ID, and size balancing. |
| balance() | Uses up- and/or downsampling to equalize group sizes. Can balance on ID level. See wrappers: downsample(), upsample(). |
Other tools
| Function | Description |
|:--------------------------|:--------------------------------------------------------------------------------------------------------------------|
| all_groups_identical() | Checks whether two grouping factors contain the same groups, memberwise. |
| differs_from_previous() | Finds values, or indices of values, that differ from the previous value by some threshold(s). |
| find_starts() | Finds values or indices of values that are not the same as the previous value. |
| find_missing_starts() | Finds missing starts for the l_starts method. |
| summarize_group_cols() | Calculates summary statistics about group columns (i.e. factors). |
| summarize_balances() | Summarizes the balances of numeric, categorical, and ID columns in and between groups in one or more group columns. |
| ranked_balances() | Extracts the standard deviations from the Summary data frame from the output of summarize_balances() |
| %primes% | Finds remainder for the primes method. |
| %staircase% | Finds remainder for the staircase method. |
Table of Contents
- groupdata2
Installation
CRAN version:
install.packages("groupdata2")
Development version:
install.packages("devtools")
devtools::install_github("LudvigOlsen/groupdata2")
Vignettes
groupdata2 contains a number of vignettes with relevant use cases and
descriptions:
vignette(package = "groupdata2")# for an overview
vignette("introduction_to_groupdata2")# begin here
Data for examples
# Attach packages
library(groupdata2)
library(dplyr) # %>% filter() arrange() summarize()
library(knitr) # kable()
# Create small data frame
df_small <- data.frame(
"x" = c(1:12),
"species" = rep(c('cat', 'pig', 'human'), 4),
"age" = sample(c(1:100), 12),
stringsAsFactors = FALSE
)
# Create medium data frame
df_medium <- data.frame(
"participant" = factor(rep(c('1', '2', '3', '4', '5', '6'), 3)),
"age" = rep(c(20, 33, 27, 21, 32, 25), 3),
"diagnosis" = factor(rep(c('a', 'b', 'a', 'b', 'b', 'a'), 3)),
"diagnosis2" = factor(sample(c('x','z','y'), 18, replace = TRUE)),
"score" = c(10, 24, 15, 35, 24, 14, 24, 40, 30,
50, 54, 25, 45, 67, 40, 78, 62, 30))
df_medium <- df_medium %>% arrange(participant)
df_medium$session <- rep(c('1','2', '3'), 6)
Functions
group_factor()
Returns a factor with group numbers,
e.g. factor(c(1,1,1,2,2,2,3,3,3)).
This can be used to subset, aggregate, group_by, etc.
Create equally sized groups by setting force_equal = TRUE
Randomize grouping factor by setting randomize = TRUE
# Create grouping factor
group_factor(
data = df_small,
n = 5,
method = "n_dist"
)
#> [1] 1 1 2 2 3 3 3 4 4 5 5 5
#> Levels: 1 2 3 4 5
group()
Creates a grouping factor and adds it to the given data frame. The data
frame is grouped by the grouping factor for easy use in magrittr
(%>%) pipelines.
# Use group()
group(data = df_small, n = 5, method = 'n_dist') %>%
kable()
| x | species | age | .groups | |----:|:--------|----:|:--------| | 1 | cat | 68 | 1 | | 2 | pig | 39 | 1 | | 3 | human | 1 | 2 | | 4 | cat | 34 | 2 | | 5 | pig | 87 | 3 | | 6 | human | 43 | 3 | | 7 | cat | 14 | 3 | | 8 | pig | 82 | 4 | | 9 | human | 59 | 4 | | 10 | cat | 51 | 5 | | 11 | pig | 85 | 5 | | 12 | human | 21 | 5 |
# Use group() in a pipeline
# Get average age per group
df_small %>%
group(n = 5, method = 'n_dist') %>%
dplyr::summarise(mean_age = mean(age)) %>%
kable()
| .groups | mean_age | |:--------|---------:| | 1 | 53.5 | | 2 | 17.5 | | 3 | 48.0 | | 4 | 70.5 | | 5 | 52.3 |
# Using group() with 'l_starts' method
# Starts group at the first 'cat',
# then skips to the second appearance of "pig" after "cat",
# then starts at the following "cat".
df_small %>%
group(n = list("cat", c("pig", 2), "cat"),
method = 'l_starts',
starts_col = "species") %>%
kable()
| x | species | age | .groups | |----:|:--------|----:|:--------| | 1 | cat | 68 | 1 | | 2 | pig | 39 | 1 | | 3 | human | 1 | 1 | | 4 | cat | 34 | 1 | | 5 | pig | 87 | 2 | | 6 | human | 43 | 2 | | 7 | cat | 14 | 3 | | 8 | pig | 82 | 3 | | 9 | human | 59 | 3 | | 10 | cat | 51 | 3 | | 11 | pig | 85 | 3 | | 12 | human | 21 | 3 |
splt()
Creates the specified groups
Related Skills
node-connect
352.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
