Klrfome
Kernel Logistic Regression on Focal Mean Embeddings
Install / Use
/learn @mrecos/KlrfomeREADME
last build: Mon Jan 24 11:40:47 2022
<p align="center"> <img width="326" height="134" src="https://github.com/mrecos/klrfome/blob/master/klrfome_logo/KLR-black.png?raw=true"> </p>klrfome - Kernel Logistic Regression on Focal Mean Embeddings
Check out the documentation site
The purpose of this package is to solve the Distribution Regression problem for noncontiguous geospatial features. The use case documented here is for modeling archaeological site locations. The aim of Distribution Regression is to map a single scalar outcome (e.g. presence/absence; 0/1) to a distribution of features. This is opposed to typical regression where you have one observation mapping a single outcome to a single set of features/predictors. For example, an archaeological site is singularly defined as either present or absent, however the area within the sites boundary is not singularly defined by any one measurement. The area with an archaeology site is defined by an infinite distribution of measurements. Modeling this in traditional terms means either collapsing that distribution to a single measurement or pretending that a site is actually a series of adjacent, but independent measurements. The methods developed for this package take a different view instead by modeling the distribution of measurements from within a single site on a scale of similarity to the distribution of measurements on other sites and the environmental background in general. This method avoids collapsing measurements and promotes the assumption of independence from within a site to between sites. By doing so, this approach models a richer description of the landscape in a more intuitive sense of similarity.
To achieve this goal, the package fits a Kernel Logistic Regression (KLR) model onto a mean embedding similarity matrix and predicts as a roving focal function of varying window size. The name of the package is derived from this approach; Kernel Logistic Regression on FOcal Mean Embeddings (klrfome) pronounced clear foam.
<p align="center"> <img width="800" height="600" src="https://github.com/mrecos/klrfome/blob/master/SAA_2018_poster/SAA_2018_poster_small.jpg?raw=true"> </p>(High-res versions of research poster are in the /SAA_2018_poster folder)
Citation
Please cite this package as:
Harris, Matthew D., (2017). klrfome - Kernel Logistic Regression on Focal Mean Embeddings. Accessed 10 Sep 2017. Online at https://doi.org/10.5281/zenodo.1218403
Special Thanks
This model is inspired by and borrows from Zoltán Szabó’s work on mean
embeddings Szabó et al. (2015) and Ji Zhu & Trevor Hastie’s Kernel
Logistic Regression algorithm (Zhu and Hastie 2005). I extend a hardy
thank you to Zoltán for his correspondence during the development of
this approach. This approach would not have been created without his
help. Further, a big thank you to Ben Markwick for his moral support
and rrtools package used to create this package. However that being
said, all errors, oversights, and omissions are my own.
Installation
You can install klrfome from github with:
# install.packages("devtools")
devtools::install_github("mrecos/klrfome")
Example workflow on simulated data (Try me!)
<p align="left"> <img src="https://github.com/mrecos/klrfome/blob/master/README_images/KLRfome_dataflow.png?raw=true"> </p>In brief, the process below is 1) take a table of observations of two or
more environmental variables within known sites and across the
background of the study area; 2) use format_data() to convert that
table to a list and under-sample the background data to a desired ratio
(each group of observations with a site or background area are referred
o in the ML literature as “bags”); 3) use build_k() function with the
sigma hyperparameter and distance (default euclidean) to create a
similarity matrix between all site and background bags; 4) the
similarity matrix is the object that the kernel logistic regression
model klr() function uses to fit its parameters. Steps 3 and 4 are
where this method detracts most from traditional regression, but it is
also what sets this method apart. unlike most regression that fits a
model to a table of measurements, this approach fits a model to a matrix
of similarities between all of the units of analysis (sites and
background areas).
Libraries
library("ggplot2") # for plotting results
library("NLMR") # for creating simulated landscapes
library("rasterVis") # for plotting simulated lan
library("pROC") # for evaluation of model AUC metric
library("dplyr") # for data manipulation
library("knitr") # for printing tables in this document
library("klrfome") # for modeling
library("sp") # for plotting raster prediction in document
Set hyperparameters and load simulated site location data
In this block, the random seed, sigma and lambda hyperparameters,
and the dist_metric are all set. The sigma parameter controls how
“close” observations must be to be considered similar. Closeness in this
context is defined in the ‘feature space,’ but in geographic or
measurement space. At a higher sigma more distant observations can
still be considered similar. The lambda hyperparameter controls the
regularization in the KLR model by penalizing large coefficients; it
must by greater than zero. This means that the higher the lambda
penalty, the closer the model will shrink its alpha parameters closer
to zero, thereby reducing the influence of any one or group of
observations on the overall model. These two hyperparameters are most
critical as the govern the flexibility and scope of the model. Ideally,
these will be set via k-fold Cross-Validation, Leave-one-out
cross-validation, grid search, or trial and error. Exploring these
hyperparameters will likely take most of the time and attention within
the modeling process.
#Parameters
set.seed(232)
sigma = 0.5
lambda = 0.1
dist_metric = "euclidean"
Simulate site data and format
Archaeological site data is protected information that can not often be
shared. It is a major impediment to open archaeological research. In
this package I created a function to simulate archaeological site data
so that people can test the package and see how to format their site
data. The get_sim_data function simulates N_site_bags number of
sites from site_sample number of observations for two environmental
variables. The variables are normal with a mean and standard deviation;
well-discriminated defaults are provided. The function simulates by
site-present and environmental background classes with different means
and SD’s for the two variables. The returned object is a list that
contains the data in a list format for use in the KLR functions, but
also a table format for use in most other modeling functions. This is
so that you can compare model results on the same data.
### Simulate Training Data
sim_data <- get_sim_data(site_samples = 800, N_site_bags = 75)
formatted_data <- format_site_data(sim_data, N_sites=10, train_test_split=0.8,
sample_fraction = 0.9, background_site_balance=1)
train_data <- formatted_data[["train_data"]]
train_presence <- formatted_data[["train_presence"]]
test_data <- formatted_data[["test_data"]]
test_presence <- formatted_data[["test_presence"]]
Build Similarilty kernel, fit KLR model, and predict on training set
The first step in modeling these data is to build the similarity kernel
with build_k. The output is a pairwise similarity matrix for each
element of the list you give it, in this case training_data. The
object K is the NxN similarity matrix of the mean similarity between
the multivariate distance of each site and background list element.
These elements are often referred to as laballed bags because they are
a collection of measurements with a presence or absence label. The
second step is to fit the KLR model with the KLR function. The KLR
fit function is a key component of this package. The function fits a KLR
model using Iterative Re-weighted Least Squares (IRLS). Verbose = 2
shows the convergence of the algorithm. The output of this function is a
list of the alpha parameters (for prediction) and predicted
probability of site-present for the train_data set. Finally, the
KLR_predict function uses the train_data and alphas to predict the
probability of site-presence for new test_data. The similarity matrix
can be visualized with corrplot, the predicted site-presence
probability for simulated site-present and background test data can be
viewed as a ggplot, and finally the parameters of the model are saved
to a list to be used for predicting on a study area raster stack.
##### Logistic Mean Embedding KRR Model
#### Build Kernel Matrix
K <- build_K(train_data, sigma = sigma, dist_metric = dist_metric, progress = FALSE)
#### Train KLR model
train_log_pred <- KLR(K, train_presence, lambda, 100, 0.001, verbose = 2)
#> Step 1. Absolute Relative Approximate Error = 120.2567
#> Step 2. Absolute Relative Approximate Error = 9.6064
#> Step 3. Absolute Relative Approximate Error = 0.6178
#> Step 4. Absolute Relative Approximate Error = 0.0477
#> Step 5. Absolute Relative Approximate Error = 0
#> Found solution in 5 steps.
#### Predict KLR model on test data
test_log_pred <- KLR_predict(test_data, train_data, dist_metric = dist_metric,
train_log_pred[["alphas"]], sigma, progress = FALSE)
### Plot K Matrix
K_corrplot(K,train_data,clusters=
