Chopin
Computation of Spatial Data by Hierarchical and Objective Partitioning of Inputs for Parallel Processing http://doi.org/10.1016/j.softx.2025.102167
Install / Use
/learn @ropensci/ChopinREADME
Computation of Spatial Data by Hierarchical and Objective Partitioning of Inputs for Parallel Processing <a href="https://docs.ropensci.org/chopin/"><img src="man/figures/logo.svg" align="right" height="210" alt="overlapping irregular grid polygons filled with orange, green, and teal" /></a>
<!-- badges: start --> <!-- [](https://github.com/ropensci/chopin/actions) --> <!-- badges: end -->Objective
This package automates
parallelization in
spatial operations with chopin functions as well as
sf/terra
functions. With GDAL-compatible files and database
tables, chopin functions help to calculate spatial variables from
vector and raster data with no external software requirements. All who
need to perform geospatial operations with large datasets may find this
package useful to accelerate the covariate calculation process for
further analysis and modeling. We assume that users have basic knowledge
of geographic information system data
models, coordinate systems and
transformations, spatial
operations, and
raster-vector overlay.
Overview
chopin encapsulates the parallel processing of spatial computation
into three steps. First, users will define the parallelization
strategy, which is one of many supported in future and future.mirai
packages. Users always need to register parallel workers with future
before running the par_*() functions that will be introduced below.
future::plan(future.mirai::mirai_multisession, workers = 4L)
# future::multisession, future::cluster are available,
# See future.batchtools and future.callr for other options
# the number of workers are up to users' choice
Second, users choose the proper data parallelization configuration
by creating a grid partition of the processing extent, defining the
field name with values that are hierarchically coded, or entering
multiple raster file paths into par_multirasters(). Finally, users
run par_*() function with the configurations set above to compute
spatial variables from input data in parallel:
-
par_grid: parallelize over artificial grid polygons that are generated from the maximum extent of inputs.par_pad_gridis used to generate the grid polygons before running this function. -
par_hierarchy: parallelize over hierarchy coded in identifier fields (for example, census blocks in each county in the US) -
par_multirasters: parallelize over multiple raster files -
Each of the
par_*functions introduced above hasmiraiversion with a suffix_miraiafter the function names:par_grid_mirai,par_hierarchy_mirai, andpar_multirasters. These functions will work properly after creating daemons withmirai::daemons.
mirai::daemons(4L)
For grid partitioning, the entire study area will be divided into partly overlapped grids. We suggest two flowcharts to help which function to use for parallel processing below. The upper flowchart is raster-oriented and the lower is vector-oriented. They are supplementary to each other. When a user follows the raster-oriented one, they might visit the vector-oriented flowchart at each end of the raster-oriented flowchart.
From version 0.9.5, chopin supports H3 and
DGGRID in par_pad_grid(). Users can
utilize each grid system with a proper resolution to improve the
efficiency of spatial operations.
Processing functions accept
terra/sf
classes for spatial data. Raster-vector overlay is done with
exactextractr. Three helper functions encapsulate multiple geospatial
data calculation steps over multiple CPU threads.
-
extract_at: extract raster values with point buffers or polygons with or without kernel weights -
summarize_sedc: calculate sums of exponentially decaying contributions -
summarize_aw: area-weighted covariates based on target and reference polygons
Function selection guide
We provide two flowcharts to help users choose the right function for parallel processing. The raster-oriented flowchart is for users who want to start with raster data, and the vector-oriented flowchart is for users with large vector data.
In raster-oriented selection, we suggest four factors to consider:
- Number of raster files: for multiple files,
par_multirastersis recommended. When there are multiple rasters that share the same extent and resolution, consider stacking the rasters into multilayer SpatRaster object by callingterra::rast(filenames). - Raster resolution: We suggest 100 meters as a threshold. Rasters with
resolution coarser than 100 meters and a few layers would be better
for the direct call of
exactextractr::exact_extract(). - Raster extent: Using
SpatRasterinexactextractr::exact_extract()is often minimally affected by the raster extent. - Memory size:
max_cells_in_memoryargument value ofexactextractr::exact_extract(), raster resolution, and the number of layers inSpatRasterare multiplicatively related to the memory usage.

For vector-oriented selection, we suggest three factors to consider:
- Number of features: When the number of features is over 100,000,
consider using
par_gridorpar_hierarchyto split the data into smaller chunks. - Hierarchical structure: If the data has a hierarchical structure,
consider using
par_hierarchyto parallelize the operation. - Data grouping: If the data needs to be grouped in similar sizes,
consider using
par_pad_balancedorpar_pad_gridwithmode = "grid_quantile".

Installation
From version 0.9.4, chopin is available on CRAN.
install.packages("chopin")
chopin can be installed using remotes::install_github (also possible
with pak::pak or devtools::install_github).
rlang::check_installed("remotes")
remotes::install_github("ropensci/chopin")
or you can also set repos in install.packages() as ROpenSci
repository:
# More recent version is available rOpenSci universe
install.packages("chopin", repos = "https://ropensci.r-universe.dev")
Examples
Examples will navigate par_grid, par_hierarchy, and
par_multirasters functions in chopin to parallelize geospatial
operations.
# check and install packages to run examples
pkgs <- c("chopin", "dplyr", "sf", "terra", "future", "future.mirai", "mirai", "h3r", "dggridR")
# install packages if anything is unavailable
rlang::check_installed(pkgs)
library(chopin)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(sf)
#> Linking to GEOS 3.12.2, GDAL 3.11.3, PROJ 9.4.1; sf_use_s2() is TRUE
library(terra)
#> terra 1.8.60
library(future)
library(future.mirai)
library(mirai)
library(h3r)
#> Loading required package: h3lib
#>
#> Attaching package: 'h3r'
#> The following object is masked from 'package:terra':
#>
#> gridDistance
library(dggridR)
# disable spherical geometries
sf::sf_use_s2(FALSE)
#> Spherical geometry (s2) switched off
# parallelization-safe random number generator
set.seed(2024, kind = "L'Ecuyer-CMRG")
par_grid: parallelize over artificial grid polygons
Please refer to a small example below for extracting mean altitude
values at circular point buffers and census tracts in North Carolina.
Before running code chunks below, set the cloned chopin repository as
your working directory with setwd()
ncpoly <- system.file("shape/nc.shp", package = "sf")
ncsf <- sf::read_sf(ncpoly)
ncsf <- sf::st_transform(ncsf, "EPSG:5070")
plot(sf::st_geometry(ncsf))
<img src="man/figures/README-read-nc-1.png" width="100%" />
Generate random points in NC
Ten thousands random point locations were generated inside the counties of North Carolina.
ncpoints <- sf::st_sample(ncsf, 1e4)
ncpoints <- sf::st_as_sf(ncpoints)
ncpoints$pid <- sprintf("PID-%05d", seq(1, 1e4))
plot(sf::st_geometry(ncpoints))
<img src="man/figures/README-gen-ncpoints-1.png" width="100%" />
Target raster dataset: Shuttle Radar Topography Mission
We use an elevation dataset with and a moderate spatial resolution (approximately 400 meters or 0.25 miles).
# da
