<img src="man/figures/logo.svg" align="right" height="139" /> R package dbscan - Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms

License

Introduction

This R package (Hahsler, Piekenbrock, and Doran 2019) provides a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data. The package includes:

Clustering

DBSCAN: Density-based spatial clustering of applications with noise (Ester et al. 1996).
Jarvis-Patrick Clustering: Clustering using a similarity measure based on shared near neighbors (Jarvis and Patrick 1973).
SNN Clustering: Shared nearest neighbor clustering (Ertöz, Steinbach, and Kumar 2003).
HDBSCAN: Hierarchical DBSCAN with simplified hierarchy extraction (Campello et al. 2015).
FOSC: Framework for optimal selection of clusters for unsupervised and semisupervised clustering of hierarchical cluster tree (Campello, Moulavi, and Sander 2013).
OPTICS/OPTICSXi: Ordering points to identify the clustering structure and cluster extraction methods (Ankerst et al. 1999).

Outlier Detection

LOF: Local outlier factor algorithm (Breunig et al. 2000).
GLOSH: Global-Local Outlier Score from Hierarchies algorithm (Campello et al. 2015).

Cluster Evaluation

DBCV: Density-based clustering validation (Moulavi et al. 2014).

Fast Nearest-Neighbor Search (using kd-trees)

kNN search
Fixed-radius NN search

The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search, and are for Euclidean distance typically faster than the native R implementations (e.g., dbscan in package fpc), or the implementations in WEKA, ELKI and Python’s scikit-learn.

The following R packages use dbscan: AnimalSequences, bioregion, clayringsmiletus, CLONETv2, clusterWebApp, cordillera, CPC, crosshap, crownsegmentr, CspStandSegmentation, daltoolbox, DataSimilarity, diceR, dobin, doc2vec, dPCP, emcAdr, eventstream, evprof, fastml, FCPS, flowcluster, funtimes, FuzzyDBScan, HaploVar, immunaut, karyotapR, ksharp, LLMing, LOMAR, maotai, MapperAlgo, metaCluster, metasnf, mlr3cluster, neuroim2, oclust, omicsTools, openSkies, opticskxi, OTclust, outlierensembles, outlierMBC, pagoda2, parameters, ParBayesianOptimization, performance, PiC, rcrisp, rMultiNet, seriation, sfdep, sfnetworks, sharp, smotefamily, snap, spdep, spNetwork, ssMRCD, stream, SuperCell, synr, tidySEM, VBphenoR, VIProDesign, weird

To cite package ‘dbscan’ in publications use:

Hahsler M, Piekenbrock M, Doran D (2019). “dbscan: Fast Density-Based Clustering with R.” Journal of Statistical Software, 91(1), 1-30. doi:10.18637/jss.v091.i01 https://doi.org/10.18637/jss.v091.i01.

@Article{,
  title = {{dbscan}: Fast Density-Based Clustering with {R}},
  author = {Michael Hahsler and Matthew Piekenbrock and Derek Doran},
  journal = {Journal of Statistical Software},
  year = {2019},
  volume = {91},
  number = {1},
  pages = {1--30},
  doi = {10.18637/jss.v091.i01},
}

Installation

Stable CRAN version: Install from within R with

install.packages("dbscan")

Current development version: Install from r-universe.

install.packages("dbscan",
    repos = c("https://mhahsler.r-universe.dev",
              "https://cloud.r-project.org/"))

Usage

Load the package and use the numeric variables in the iris dataset

library("dbscan")

data("iris")
x <- as.matrix(iris[, 1:4])

DBSCAN

db <- dbscan(x, eps = 0.42, minPts = 5)
db

## DBSCAN clustering for 150 objects.
## Parameters: eps = 0.42, minPts = 5
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 3 cluster(s) and 29 noise points.
## 
##  0  1  2  3 
## 29 48 37 36 
## 
## Available fields: cluster, eps, minPts, metric, borderPoints

Visualize the resulting clustering (noise points are shown in black).

pairs(x, col = db$cluster + 1L)

OPTICS

opt <- optics(x, eps = 1, minPts = 4)
opt

## OPTICS ordering/clustering for 150 objects.
## Parameters: minPts = 4, eps = 1, eps_cl = NA, xi = NA
## Available fields: order, reachdist, coredist, predecessor, minPts, eps,
##                   eps_cl, xi

Extract DBSCAN-like clustering from OPTICS and create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored)

opt <- extractDBSCAN(opt, eps_cl = 0.4)
plot(opt)

HDBSCAN

hdb <- hdbscan(x, minPts = 4)
hdb

## HDBSCAN clustering for 150 objects.
## Parameters: minPts = 4
## The clustering contains 2 cluster(s) and 0 noise points.
## 
##   1   2 
## 100  50 
## 
## Available fields: cluster, minPts, coredist, cluster_scores,
##                   membership_prob, outlier_scores, hc

Visualize the hierarchical clustering as a simplified tree. HDBSCAN finds 2 stable clusters.

plot(hdb, show_flat = TRUE)

Using dbscan with tidyverse

dbscan provides for all clustering algorithms tidy(), augment(), and glance() so they can be easily used with tidyverse, ggplot2 and tidymodels.

library(tidyverse)
db <- x %>%
    dbscan(eps = 0.42, minPts = 5)

Get cluster statistics as a tibble

tidy(db)

## # A tibble: 4 × 3
##   cluster  size noise
##   <fct>   <int> <lgl>
## 1 0          29 TRUE 
## 2 1          48 FALSE
## 3 2          37 FALSE
## 4 3

Dbscan

Install / Use

README

<img src="man/figures/logo.svg" align="right" height="139" /> R package dbscan - Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms

Introduction

Installation

Usage

Using dbscan with tidyverse