Unheadr
Handle Data With Embedded Subheaders
Install / Use
/learn @luisDVA/UnheadrREADME
unheadr <img src="man/figures/logo.png" align="right" height="139" alt="" />
<!-- badges: start --> <!-- badges: end -->The goal of unheadr is to help wrangle data when it has embedded
subheaders, or when values are wrapped across several rows.
https://unheadr.liomys.mx/
Installation
You can install the CRAN release or the development version with:
# Install unheadr from CRAN:
install.packages("unheadr")
# Or install the development version from GitHub with:
# install.packages("remotes")
remotes::install_github("luisDVA/unheadr")
The reasoning behind the package and some of the possible uses of
unheadr are described in this publication:
Verde Arregoitia, L. D., Cooper, N., D’Elía, G. (2018). Good practices for sharing analysis-ready data in mammalogy and biodiversity research. Hystrix, the Italian Journal of Mammalogy, 29(2), 155-161. Open Access, DOI 10.4404/hystrix-00133-2018
Usage
Load the package first.
library(unheadr)
Main functions
untangle2()
untangle2() puts embedded subheaders into their own variable, using
regular expressions to identify them.
In the data below (a subset of a bundled dataset which can be loaded
with data(primates2017)), there are rows that correspond to values in
grouping variables. These should be in their own column. Instead, they
are embedded within the data rectangle. This is a common practice in
many disciplines. This data presentation looks OK and is easy to read,
but hard to work with (for example: calculating group-wise summaries).
In this example, values for an implicit “geographic region” variable and an implicit “taxonomic family” variable are embedded in the column that contains the observational units (the scientific names of various primates).
| scientific_name | common_name | red_list_status | mass_kg | |:---|:---|:---|---:| | Asia | NA | NA | NA | | CERCOPITHECIDAE | NA | NA | NA | | Trachypithecus obscurus | Dusky Langur | NT | 7.13 | | Presbytis sumatra | Black Sumatran Langur | EN | 6.00 | | Rhinopithecus roxellana | Golden Snub-nosed Monkey | EN | NA | | HYLOBATIDAE | NA | NA | NA | | Hylobates funereus | East Bornean Gray Gibbon | EN | NA | | Hylobates klossii | Kloss’s Gibbon | EN | 5.80 | | Nomascus concolor | Western Black Crested Gibbon | CR | 7.71 |
For a tidier structure, the subheaders embedded in the scientific_name
column need to be plucked out and placed in their own variable. This was
initially the main objective of unheadr and what untangle2() was
made for. The function can be used with magrittr pipes as a
dplyr-type verb.
If these subheaders can be matched in bulk with a regular expression because they share a prefix, suffix, or anything in common, we can save a lot of time. Otherwise, they can be matched by name. For more details, see the examples and vignette.
The ‘untangled’ version of the data:
| scientific_name | common_name | red_list_status | mass_kg | family | region | |:---|:---|:---|---:|:---|:---| | Trachypithecus obscurus | Dusky Langur | NT | 7.13 | CERCOPITHECIDAE | Asia | | Presbytis sumatra | Black Sumatran Langur | EN | 6.00 | CERCOPITHECIDAE | Asia | | Rhinopithecus roxellana | Golden Snub-nosed Monkey | EN | NA | CERCOPITHECIDAE | Asia | | Hylobates funereus | East Bornean Gray Gibbon | EN | NA | HYLOBATIDAE | Asia | | Hylobates klossii | Kloss’s Gibbon | EN | 5.80 | HYLOBATIDAE | Asia | | Nomascus concolor | Western Black Crested Gibbon | CR | 7.71 | HYLOBATIDAE | Asia |
Now we can easily perform grouping operations and summarize the data (e.g. calculating average body mass by Family).
unbreak_vals()
This function uses regex to fix values that are broken across two rows. This usually happens when we are formatting a table and we need to fit it on a page.
# Set up a toy dataset
dogsDesc <-
data.frame(
stringsAsFactors = FALSE,
dogs = c(
"Retriever", "(Golden)",
"Retriever", "(Labrador)", "Bulldog", "(French)"
),
coat = c("long", NA, "short", NA, "short", NA)
)
dogsDesc
#> dogs coat
#> 1 Retriever long
#> 2 (Golden) <NA>
#> 3 Retriever short
#> 4 (Labrador) <NA>
#> 5 Bulldog short
#> 6 (French) <NA>
We can match the opening brackets with regex.
unbreak_vals(df = dogsDesc, regex = "^\\(", ogcol = dogs, newcol = dogs_unbroken)
#> dogs_unbroken coat
#> 1 Retriever (Golden) long
#> 2 Retriever (Labrador) short
#> 3 Bulldog (French) short
unwrap_cols()
Use this function to unwrap and glue values that have been wrapped
across multiple rows for presentation purposes, with an inconsistent
number of empty or NA values padding out the columns.
# Set up the data
nyk <-
data.frame(
stringsAsFactors = FALSE,
player = c(
"Marcus Camby", NA, NA,
NA, NA, NA, NA, "Allan Houston", NA,
"Latrell Sprewell", NA, NA
),
listed_height_m. = c(
2.11, NA, NA, NA, NA, NA,
NA, 1.98, NA, 1.96, NA, NA
),
teams_chronological = c(
"Raptors", "Knicks",
"Nuggets", "Clippers", "Trail Blazers",
"Rockets", "Knicks", "Pistons",
"Knicks", "Warriors", "Knicks",
"Timberwolves"
),
position = c(
"Power forward", "Center",
NA, NA, NA, NA, NA,
"Shooting guard", NA, "Small forward", NA, NA
)
)
nyk
#> player listed_height_m. teams_chronological position
#> 1 Marcus Camby 2.11 Raptors Power forward
#> 2 <NA> NA Knicks Center
#> 3 <NA> NA Nuggets <NA>
#> 4 <NA> NA Clippers <NA>
#> 5 <NA> NA Trail Blazers <NA>
#> 6 <NA> NA Rockets <NA>
#> 7 <NA> NA Knicks <NA>
#> 8 Allan Houston 1.98 Pistons Shooting guard
#> 9 <NA> NA Knicks <NA>
#> 10 Latrell Sprewell 1.96 Warriors Small forward
#> 11 <NA> NA Knicks <NA>
#> 12 <NA> NA Timberwolves <NA>
Unwrap the elements in the variable that defines the groups, separating with commas.
unwrap_cols(nyk, groupingVar = player, separator = ", ")
#> # A tibble: 3 × 4
#> player listed_height_m. teams_chronological position
#> <chr> <chr> <chr> <chr>
#> 1 Marcus Camby 2.11 Raptors, Knicks, Nuggets, Clippers… Power f…
#> 2 Allan Houston 1.98 Pistons, Knicks Shootin…
#> 3 Latrell Sprewell 1.96 Warriors, Knicks, Timberwolves Small f…
unbreak_rows()
This function merges sets of two contiguous rows upwards by pasting the values of the lagging row to the values of the leading row (identified using regular expressions).
The following table of basketball records has two sets of header rows with values broken across two contiguous rows.
bball <- data.frame(
stringsAsFactors = FALSE,
v1 = c(
"Player", NA, "Sleve McDichael", "Dean Wesrey",
"Karl Dandleton", "Player",
NA,
"Mike Sernandez",
"Glenallen Mixon",
"Rey McSriff"
),
v2 = c(
"Most points", "in a game", "55", "43", "41", "Most varsity",
"games played", "111", "109",
"104"
),
v3 = c(
"Season", "(year ending)", "2001", "2000", "2010", "Season",
"(year ending)", "2005",
"2004", "2002"
)
)
unbreak_rows() merges these rows if we can match them with a common
pattern.
# Match with regex on variable v2
unbreak_rows(bball, regex = "^Most", ogcol = v2)
#> 2 matches
#> v1 v2 v3
#> 1 Player Most points in a game Season (year ending)
#> 2 Sleve McDichael 55 2001
#> 3 Dean Wesrey 43 2000
#> 4 Karl Dandleton 41 2010
#> 5 Player Most varsity games played Season (year ending)
#> 6 Mike Sernandez 111 2005
#> 7 Glenallen Mixon 109 2004
#> 8 Rey McSriff 104 2002
mash_colnames()
When column names are broken up across the top n rows of a data frame
or tibble, mash_colnames() makes many header rows into column names.
Existing names can be kept or ignored.
# Data with broken headers
babies <-
data.frame(
stringsAsFactors = FALSE,
Baby = c(NA, NA, "Angie", "Yean", "Pierre"),
Age = c("in", "months", "11", "9", "7"),
Weight = c("kg", NA, "2", "3", "4"),
Ward = c(NA, NA, "A", "B", "C")
)
babies
#> Baby Age Weight Ward
#> 1 <NA> in kg <NA>
#> 2 <NA> months <NA> <NA>
#> 3 Angie 11 2 A
#> 4 Yean 9 3 B
#> 5 Pierre 7 4 C
# Mash, including the object names
mash_colnames(babies, n_name_rows = 2, keep_
Related Skills
node-connect
351.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
351.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
351.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
