SkillAgentSearch skills...

Groupdata2

R-package: Methods for dividing data into groups. Create balanced partitions and cross-validation folds. Perform time series windowing and general grouping and splitting of data. Balance existing groups with up- and downsampling or collapse them to fewer groups.

Install / Use

/learn @LudvigOlsen/Groupdata2

README

<!-- README.md is generated from README.Rmd. Please edit that file -->

groupdata2 <a href='https://github.com/LudvigOlsen/groupdata2'><img src='man/figures/groupdata2_logo_242x280_250dpi.png' align="right" height="140" /></a>

Author: Ludvig R. Olsen ( r-pkgs@ludvigolsen.dk ) <br/> License: MIT <br/> Started: October 2016

CRAN_Status_Badge metacran
downloads minimal R
version Codecov test
coverage GitHub Actions CI
status AppVeyor build
status DOI

Overview

R package for dividing data into groups.

  • Create balanced partitions and cross-validation folds.
  • Perform time series windowing and general grouping and splitting of data.
  • Balance existing groups with up- and downsampling.
  • Collapse existing groups to fewer, balanced groups.
  • Finds values, or indices of values, that differ from the previous value by some threshold(s).
  • Check if two grouping factors have the same groups, memberwise.

Main functions

| Function | Description | |:--------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | group_factor() | Divides data into groups by a wide range of methods. | | group() | Creates grouping factor and adds to the given data frame. | | splt() | Creates grouping factor and splits the data by these groups. | | partition() | Splits data into partitions. Balances a given categorical variable and/or numerical variable between partitions and keeps all data points with a shared ID in the same partition. | | fold() | Creates folds for (repeated) cross-validation. Balances a given categorical variable and/or numerical variable between folds and keeps all data points with a shared ID in the same fold. | | collapse_groups() | Collapses existing groups into a smaller set of groups with categorical, numerical, ID, and size balancing. | | balance() | Uses up- and/or downsampling to equalize group sizes. Can balance on ID level. See wrappers: downsample(), upsample(). |

Other tools

| Function | Description | |:--------------------------|:--------------------------------------------------------------------------------------------------------------------| | all_groups_identical() | Checks whether two grouping factors contain the same groups, memberwise. | | differs_from_previous() | Finds values, or indices of values, that differ from the previous value by some threshold(s). | | find_starts() | Finds values or indices of values that are not the same as the previous value. | | find_missing_starts() | Finds missing starts for the l_starts method. | | summarize_group_cols() | Calculates summary statistics about group columns (i.e. factors). | | summarize_balances() | Summarizes the balances of numeric, categorical, and ID columns in and between groups in one or more group columns. | | ranked_balances() | Extracts the standard deviations from the Summary data frame from the output of summarize_balances() | | %primes% | Finds remainder for the primes method. | | %staircase% | Finds remainder for the staircase method. |

Table of Contents

Installation

CRAN version:

install.packages("groupdata2")

Development version:

install.packages("devtools")
devtools::install_github("LudvigOlsen/groupdata2")

Vignettes

groupdata2 contains a number of vignettes with relevant use cases and descriptions:

vignette(package = "groupdata2") # for an overview
vignette("introduction_to_groupdata2") # begin here

Data for examples

# Attach packages
library(groupdata2)
library(dplyr)       # %>% filter() arrange() summarize()
library(knitr)       # kable()
# Create small data frame
df_small <- data.frame(
  "x" = c(1:12),
  "species" = rep(c('cat', 'pig', 'human'), 4),
  "age" = sample(c(1:100), 12),
  stringsAsFactors = FALSE
)
# Create medium data frame
df_medium <- data.frame(
  "participant" = factor(rep(c('1', '2', '3', '4', '5', '6'), 3)),
  "age" = rep(c(20, 33, 27, 21, 32, 25), 3),
  "diagnosis" = factor(rep(c('a', 'b', 'a', 'b', 'b', 'a'), 3)),
  "diagnosis2" = factor(sample(c('x','z','y'), 18, replace = TRUE)),
  "score" = c(10, 24, 15, 35, 24, 14, 24, 40, 30, 
              50, 54, 25, 45, 67, 40, 78, 62, 30))
df_medium <- df_medium %>% arrange(participant)
df_medium$session <- rep(c('1','2', '3'), 6)

Functions

group_factor()

Returns a factor with group numbers, e.g. factor(c(1,1,1,2,2,2,3,3,3)).

This can be used to subset, aggregate, group_by, etc.

Create equally sized groups by setting force_equal = TRUE

Randomize grouping factor by setting randomize = TRUE

# Create grouping factor
group_factor(
  data = df_small, 
  n = 5, 
  method = "n_dist"
)
#>  [1] 1 1 2 2 3 3 3 4 4 5 5 5
#> Levels: 1 2 3 4 5

group()

Creates a grouping factor and adds it to the given data frame. The data frame is grouped by the grouping factor for easy use in magrittr (%>%) pipelines.

# Use group()
group(data = df_small, n = 5, method = 'n_dist') %>%
  kable()

| x | species | age | .groups | |----:|:--------|----:|:--------| | 1 | cat | 68 | 1 | | 2 | pig | 39 | 1 | | 3 | human | 1 | 2 | | 4 | cat | 34 | 2 | | 5 | pig | 87 | 3 | | 6 | human | 43 | 3 | | 7 | cat | 14 | 3 | | 8 | pig | 82 | 4 | | 9 | human | 59 | 4 | | 10 | cat | 51 | 5 | | 11 | pig | 85 | 5 | | 12 | human | 21 | 5 |

# Use group() in a pipeline 
# Get average age per group
df_small %>%
  group(n = 5, method = 'n_dist') %>% 
  dplyr::summarise(mean_age = mean(age)) %>%
  kable()

| .groups | mean_age | |:--------|---------:| | 1 | 53.5 | | 2 | 17.5 | | 3 | 48.0 | | 4 | 70.5 | | 5 | 52.3 |

# Using group() with 'l_starts' method
# Starts group at the first 'cat', 
# then skips to the second appearance of "pig" after "cat",
# then starts at the following "cat".
df_small %>%
  group(n = list("cat", c("pig", 2), "cat"),
        method = 'l_starts',
        starts_col = "species") %>%
  kable()

| x | species | age | .groups | |----:|:--------|----:|:--------| | 1 | cat | 68 | 1 | | 2 | pig | 39 | 1 | | 3 | human | 1 | 1 | | 4 | cat | 34 | 1 | | 5 | pig | 87 | 2 | | 6 | human | 43 | 2 | | 7 | cat | 14 | 3 | | 8 | pig | 82 | 3 | | 9 | human | 59 | 3 | | 10 | cat | 51 | 3 | | 11 | pig | 85 | 3 | | 12 | human | 21 | 3 |

splt()

Creates the specified groups

Related Skills

View on GitHub
GitHub Stars26
CategoryDevelopment
Updated4mo ago
Forks3

Languages

R

Security Score

77/100

Audited on Dec 6, 2025

No findings