Tidyods
Read cells from ODS files in R
Install / Use
/learn @mattkerlogue/TidyodsREADME
tidyods <img src="man/figures/tidyods_hex.png" align="right" alt="tidyods package logo" width="120" />
<!-- badges: start --> <!-- badges: end -->{tidyods} imports cells from an OpenDocument Spreadsheet (ODS) file,
and provides not just the cell’s value but also information about the
cell’s position, value types and formulas in a tidy format to allow
further programmatic analysis, investigation and manipulation. It also
provides methods to “rectify” cells back to a 2-dimensional data.frame.
Installation
You can install the development version of {tidyods} like so:
remotes::install_github("mattkerlogue/tidyods")
Usage
You can read in cells via the read_ods_cells() function. There is a
helper function ods_sheets() that can list the sheets in an ODS file.
The package includes an example ODS with two sheets: penguins and
types.
library(tidyods)
example_file <- system.file("extdata", "basic_example.ods", package = "tidyods")
ods_sheets(example_file)
#> [1] "penguins" "types" "merges"
read_ods_cells(example_file, sheet = "penguins", quiet = TRUE)
#> # A tibble: 28 × 28
#> sheet address row col cell_type is_empty value_type cell_content
#> <chr> <chr> <dbl> <dbl> <chr> <lgl> <chr> <chr>
#> 1 penguins A1 1 1 cell FALSE string species
#> 2 penguins B1 1 2 cell FALSE string female
#> 3 penguins C1 1 3 cell FALSE string bill_length_mm
#> 4 penguins D1 1 4 cell FALSE string body_mass_g
#> 5 penguins A2 2 1 cell FALSE string Adelie
#> 6 penguins B2 2 2 cell FALSE boolean FALSE
#> 7 penguins C2 2 3 cell FALSE float 40.4
#> 8 penguins D2 2 4 cell FALSE float 4043
#> 9 penguins A3 3 1 cell FALSE string Adelie
#> 10 penguins B3 3 2 cell FALSE boolean TRUE
#> # ℹ 18 more rows
#> # ℹ 20 more variables: base_value <chr>, numeric_value <dbl>,
#> # currency_symbol <chr>, boolean_value <lgl>, date_value <chr>,
#> # time_value <chr>, has_formula <lgl>, formula <chr>, has_error <lgl>,
#> # error_type <dbl>, has_annotation <lgl>, annotation <chr>, is_merged <lgl>,
#> # merge_colspan <dbl>, merge_rowspan <dbl>, merge_shape <chr>,
#> # cell_style <chr>, row_style <chr>, col_style <chr>, …
The penguins sheet is a simple 6 rows by 4 columns sheet that stores
the output of the following code:
palmerpenguins::penguins |>
tidyr::drop_na() |>
dplyr::group_by(species, female = sex == "female") |>
dplyr::summarise(
dplyr::across(c(bill_length_mm, body_mass_g), ~mean(.x, na.rm = TRUE)),
.groups = "drop"
)
#> # A tibble: 6 × 4
#> species female bill_length_mm body_mass_g
#> <fct> <lgl> <dbl> <dbl>
#> 1 Adelie FALSE 40.4 4043.
#> 2 Adelie TRUE 37.3 3369.
#> 3 Chinstrap FALSE 51.1 3939.
#> 4 Chinstrap TRUE 46.6 3527.
#> 5 Gentoo FALSE 49.5 5485.
#> 6 Gentoo TRUE 45.6 4680.
There are functions to “rectify” a {tidyods} data.frame back to a
traditional 2-dimensional array, the function read_ods_sheet()
combines read_ods_cells() and the rectify functions to easily import a
dataset.
penguin_sheet <- read_ods_sheet(example_file, "penguins", quick = TRUE, quiet = TRUE)
penguin_sheet
#> # A tibble: 7 × 4
#> x1 x2 x3 x4
#> <chr> <chr> <chr> <chr>
#> 1 species female bill_length_mm body_mass_g
#> 2 Adelie FALSE 40.3904109589041 4043.49315068493
#> 3 Adelie TRUE 37.2575342465753 3368.83561643836
#> 4 Chinstrap FALSE 51.0941176470588 3938.97058823529
#> 5 Chinstrap TRUE 46.5735294117647 3527.20588235294
#> 6 Gentoo FALSE 49.4737704918033 5484.83606557377
#> 7 Gentoo TRUE 45.5637931034483 4679.74137931035
The types sheet shows examples of the different ODS data types:
types_cells <- read_ods_cells(example_file, "types", quiet = TRUE)
types_cells |>
dplyr::filter(row > 1) |>
dplyr::group_by(col) |>
dplyr::glimpse()
#> Rows: 150
#> Columns: 28
#> Groups: col [10]
#> $ sheet <chr> "types", "types", "types", "types", "types", "t…
#> $ address <chr> "A2", "B2", "C2", "D2", "E2", "F2", "G2", "H2",…
#> $ row <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,…
#> $ col <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6…
#> $ cell_type <chr> "cell", "cell", "cell", "cell", "cell", "cell",…
#> $ is_empty <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ value_type <chr> "string", "boolean", "currency", "date", "time"…
#> $ cell_content <chr> "Cell", "TRUE", "£1.20", "15/06/22", "13:24:56"…
#> $ base_value <chr> "Cell", "true", "1.2", "2022-06-15", "PT13H24M5…
#> $ numeric_value <dbl> NA, NA, 1.2000, NA, NA, NA, 12034.5679, 0.5467,…
#> $ currency_symbol <chr> NA, NA, "GBP", NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ boolean_value <lgl> NA, TRUE, NA, NA, NA, NA, NA, NA, NA, NA, NA, F…
#> $ date_value <chr> NA, NA, NA, "2022-06-15", NA, "2022-06-15T13:24…
#> $ time_value <chr> NA, NA, NA, NA, "PT13H24M56S", NA, NA, NA, NA, …
#> $ has_formula <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ formula <chr> NA, NA, NA, NA, NA, NA, NA, NA, "of:=[.G2]*[.H2…
#> $ has_error <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ error_type <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 7, NA, NA, …
#> $ has_annotation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ annotation <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Test c…
#> $ is_merged <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ merge_colspan <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ merge_rowspan <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ merge_shape <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ cell_style <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ row_style <chr> "ro1", "ro1", "ro1", "ro1", "ro1", "ro1", "ro1"…
#> $ col_style <chr> "co6", "co7", "co8", "co9", "co10", "co11", "co…
#> $ col_default_cell_style <chr> "Default", "ce15", "ce4", "ce11", "ce21", "ce31…
types_cells |>
dplyr::filter(row > 1) |>
dplyr::group_by(col) |>
dplyr::slice_head(n = 2) |>
dplyr::select(-cell_type)
#> # A tibble: 20 × 27
#> # Groups: col [10]
#> sheet address row col is_empty value_type cell_content base_value
#> <chr> <chr> <dbl> <dbl> <lgl> <chr> <chr> <chr>
#> 1 types A2 2 1 FALSE string Cell Cell
#> 2 types A3 3 1 FALSE string Cell with comment Cell with …
#> 3 types B2 2 2 FALSE boolean TRUE true
#> 4 types B3 3 2 FALSE boolean FALSE false
#> 5 types C2 2 3 FALSE currency £1.20 1.2
#> 6 types C3 3 3 FALSE currency £1.20 1.2
#> 7 types D2 2 4 FALSE date 15/06/22 2022-06-15
#> 8 types D3 3 4 FALSE date 06/15/22 2022-06-15
#> 9 types E2 2 5 FALSE time 13:24:56 PT13H24M56S
#> 10 types E3 3 5 FALSE time 13:24 PT13H24M56S
#> 11 types F2 2 6 FALSE date 15/06/2022 13:24:56 2022-06-15…
#> 12 types F3 3 6 FALSE date 15/06/22 13:24 2022-06-15…
#> 13 types G2 2 7 FALSE float 12035 12034.56789
#> 14 types G3 3 7 FALSE float 12034.57 12034.56789
#> 15 types H2 2 8 FALSE float 0.5467 0.5467
#> 16 types H3 3 8 FALSE percentage 55% 0.5467
#> 17 types I2 2 9 FALSE float 6579.3 6579.29826…
#> 18 types I3 3 9 FALSE float 6579.3 6579.29826…
#> 19 types J2 2 10 FALSE string #N/A #N/A
#> 20 types J3 3 10 FALSE string #DIV/0! #DIV/0!
#> # ℹ 19 more variables: numeric_value <dbl>, currency_symbol <chr>,
#> # boolean_value <lgl>, date_value <chr>, time_value <chr>, has_formula <lgl>,
#> # formula <chr>, has_error <lgl>, error_type <dbl>, has_annotation <lgl>,
#> # annotation <chr>, is_merged <lgl>, merge_colspan <dbl>,
#> # merge_rowspan <dbl>, merge_shape <chr>, cell_style <chr>, row_style <chr>,
#> # col_style <chr>, col_default_cell_style <chr>
Performance
An ODS file is a zipped collection of XML files and associated files.
{tidyods}, like the {readODS} package, uses the
{xml2} package to process this file.
The main aim of {tidyods} is to extract a large set of information
about cells, not just just their location and value, and while slower
than {readODS} for small and medium sized files users are unlikely to
see noticeable differences when using `read_od
