Computation of Spatial Data by Hierarchical and Objective Partitioning of Inputs for Parallel Processing <a href="https://docs.ropensci.org/chopin/"><img src="man/figures/logo.svg" align="right" height="210" alt="overlapping irregular grid polygons filled with orange, green, and teal" /></a>

Coverage

Objective

This package automates parallelization in spatial operations with chopin functions as well as sf/terra functions. With GDAL-compatible files and database tables, chopin functions help to calculate spatial variables from vector and raster data with no external software requirements. All who need to perform geospatial operations with large datasets may find this package useful to accelerate the covariate calculation process for further analysis and modeling. We assume that users have basic knowledge of geographic information system data models, coordinate systems and transformations, spatial operations, and raster-vector overlay.

Overview

chopin encapsulates the parallel processing of spatial computation into three steps. First, users will define the parallelization strategy, which is one of many supported in future and future.mirai packages. Users always need to register parallel workers with future before running the par_*() functions that will be introduced below.

future::plan(future.mirai::mirai_multisession, workers = 4L)
# future::multisession, future::cluster are available,
# See future.batchtools and future.callr for other options
# the number of workers are up to users' choice

Second, users choose the proper data parallelization configuration by creating a grid partition of the processing extent, defining the field name with values that are hierarchically coded, or entering multiple raster file paths into par_multirasters(). Finally, users run par_*() function with the configurations set above to compute spatial variables from input data in parallel:

par_grid: parallelize over artificial grid polygons that are generated from the maximum extent of inputs. par_pad_grid is used to generate the grid polygons before running this function.
par_hierarchy: parallelize over hierarchy coded in identifier fields (for example, census blocks in each county in the US)
par_multirasters: parallelize over multiple raster files
Each of the par_* functions introduced above has mirai version with a suffix _mirai after the function names: par_grid_mirai, par_hierarchy_mirai, and par_multirasters. These functions will work properly after creating daemons with mirai::daemons.

mirai::daemons(4L)

For grid partitioning, the entire study area will be divided into partly overlapped grids. We suggest two flowcharts to help which function to use for parallel processing below. The upper flowchart is raster-oriented and the lower is vector-oriented. They are supplementary to each other. When a user follows the raster-oriented one, they might visit the vector-oriented flowchart at each end of the raster-oriented flowchart.

From version 0.9.5, chopin supports H3 and DGGRID in par_pad_grid(). Users can utilize each grid system with a proper resolution to improve the efficiency of spatial operations.

Processing functions accept terra/sf classes for spatial data. Raster-vector overlay is done with exactextractr. Three helper functions encapsulate multiple geospatial data calculation steps over multiple CPU threads.

extract_at: extract raster values with point buffers or polygons with or without kernel weights
summarize_sedc: calculate sums of exponentially decaying contributions
summarize_aw: area-weighted covariates based on target and reference polygons

Function selection guide

We provide two flowcharts to help users choose the right function for parallel processing. The raster-oriented flowchart is for users who want to start with raster data, and the vector-oriented flowchart is for users with large vector data.

In raster-oriented selection, we suggest four factors to consider:

Number of raster files: for multiple files, par_multirasters is recommended. When there are multiple rasters that share the same extent and resolution, consider stacking the rasters into multilayer SpatRaster object by calling terra::rast(filenames).
Raster resolution: We suggest 100 meters as a threshold. Rasters with resolution coarser than 100 meters and a few layers would be better for the direct call of exactextractr::exact_extract().
Raster extent: Using SpatRaster in exactextractr::exact_extract() is often minimally affected by the raster extent.
Memory size: max_cells_in_memory argument value of exactextractr::exact_extract(), raster resolution, and the number of layers in SpatRaster are multiplicatively related to the memory usage.

For vector-oriented selection, we suggest three factors to consider:

Number of features: When the number of features is over 100,000, consider using par_grid or par_hierarchy to split the data into smaller chunks.
Hierarchical structure: If the data has a hierarchical structure, consider using par_hierarchy to parallelize the operation.
Data grouping: If the data needs to be grouped in similar sizes, consider using par_pad_balanced or par_pad_grid with mode = "grid_quantile".

Installation

From version 0.9.4, chopin is available on CRAN.

install.packages("chopin")

chopin can be installed using remotes::install_github (also possible with pak::pak or devtools::install_github).

rlang::check_installed("remotes")
remotes::install_github("ropensci/chopin")

or you can also set repos in install.packages() as ROpenSci repository:

# More recent version is available rOpenSci universe
install.packages("chopin", repos = "https://ropensci.r-universe.dev")

Examples

Examples will navigate par_grid, par_hierarchy, and par_multirasters functions in chopin to parallelize geospatial operations.

# check and install packages to run examples
pkgs <- c("chopin", "dplyr", "sf", "terra", "future", "future.mirai", "mirai", "h3r", "dggridR")
# install packages if anything is unavailable
rlang::check_installed(pkgs)

library(chopin)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(sf)
#> Linking to GEOS 3.12.2, GDAL 3.11.3, PROJ 9.4.1; sf_use_s2() is TRUE
library(terra)
#> terra 1.8.60
library(future)
library(future.mirai)
library(mirai)
library(h3r)
#> Loading required package: h3lib
#> 
#> Attaching package: 'h3r'
#> The following object is masked from 'package:terra':
#> 
#>     gridDistance
library(dggridR)

# disable spherical geometries
sf::sf_use_s2(FALSE)
#> Spherical geometry (s2) switched off

# parallelization-safe random number generator
set.seed(2024, kind = "L'Ecuyer-CMRG")

`par_grid`: parallelize over artificial grid polygons

Please refer to a small example below for extracting mean altitude values at circular point buffers and census tracts in North Carolina. Before running code chunks below, set the cloned chopin repository as your working directory with setwd()

ncpoly <- system.file("shape/nc.shp", package = "sf")
ncsf <- sf::read_sf(ncpoly)
ncsf <- sf::st_transform(ncsf, "EPSG:5070")
plot(sf::st_geometry(ncsf))

Generate random points in NC

Ten thousands random point locations were generated inside the counties of North Carolina.

ncpoints <- sf::st_sample(ncsf, 1e4)
ncpoints <- sf::st_as_sf(ncpoints)
ncpoints$pid <- sprintf("PID-%05d", seq(1, 1e4))
plot(sf::st_geometry(ncpoints))

Target raster dataset: Shuttle Radar Topography Mission

We use an elevation dataset with and a moderate spatial resolution (approximately 400 meters or 0.25 miles).

# da

Chopin

Install / Use

README

Computation of Spatial Data by Hierarchical and Objective Partitioning of Inputs for Parallel Processing <a href="https://docs.ropensci.org/chopin/"><img src="man/figures/logo.svg" align="right" height="210" alt="overlapping irregular grid polygons filled with orange, green, and teal" /></a>

Objective

Overview

Function selection guide

Installation

Examples

`par_grid`: parallelize over artificial grid polygons

Generate random points in NC

Target raster dataset: Shuttle Radar Topography Mission

Chopin

Install / Use

README

Computation of Spatial Data by Hierarchical and Objective Partitioning of Inputs for Parallel Processing <a href="https://docs.ropensci.org/chopin/"><img src="man/figures/logo.svg" align="right" height="210" alt="overlapping irregular grid polygons filled with orange, green, and teal" /></a>

Objective

Overview

Function selection guide

Installation

Examples

par_grid: parallelize over artificial grid polygons

Generate random points in NC

Target raster dataset: Shuttle Radar Topography Mission

`par_grid`: parallelize over artificial grid polygons