Dbscan
An optimized implementation of the DBSCAN algorithm in C++. The grid-based approach is optimized using parallelism, including OpenMP, C++ Concurrency API, and work-stealing technique. AVX instructions are used to optimize floating-point arithmetic operations. The program achieves a 35.89x runtime improvement with 8 threads.
Install / Use
/learn @weilun-chiu/DbscanREADME
C++ Parallelized Grid-based DBSCAN
This project provides an optimized implementation of the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm in C++. DBSCAN is a popular unsupervised machine learning algorithm used for clustering and outlier detection. The project includes a naive approach and a grid-based approach, and utilizes parallelism to improve the algorithm's performance.
Table of Contents
Introduction
The aim of this project is to provide an efficient and scalable implementation of the DBSCAN algorithm in C++. The grid-based approach is optimized by using a 2D grid to index data points and speeding up the neighbor search process. Parallelism is utilized to accelerate the algorithm's performance, including OpenMP and C++ Concurrency API. Additionally, we have used AVX instructions to optimize floating-point arithmetic operations in the program.
Usage
To make use of this project, you can add the DBSCAN header file #include "dbscan.h" to your C++ project and create instances of the provided dbscan object types listed below. These objects take your data as an input parameter, and the result will be a vector containing cluster labels for every point in your dataset.
- NaiveDBSCAN
- SerialGridDBSCAN
- OMPGridDBSCAN
- ConcurrencyGridDBSCAN
- ConcurrencyStealingGridDBSCAN
- ConcurrencyStealingAVX2GridDBSCAN
To build
cd build
cmake ..
make
Example
The main.cpp file serves as the primary example of usage, while the test.cpp file exhibits the results of the unit tests.
// Data generations
DataGenerator generator(2, 100000);
auto cluster_data = generator.cluster_distribution(3);
SerialGridDBSCAN serial_grid_dbscan(eps, minPts);
serial_grid_dbscan.run(cluster_data);
OMPGridDBSCAN omp_grid_dbscan(eps, minPts);
omp_grid_dbscan.run(cluster_data);
ConcurrencyGridDBSCAN concurrency_grid_dbscan(eps, minPts);
concurrency_grid_dbscan.run(cluster_data);
ConcurrencyStealingGridDBSCAN concurrency_stealing_grid_dbscan(eps, minPts);
concurrency_stealing_grid_dbscan.run(cluster_data);
Milestones
This project has three milestones:
- [x] Naive Approach: Implement DBSCAN using a naive approach and benchmark its performance.
- [x] Grid-Based Approach: Optimize the algorithm by using a grid-based approach for neighbor search and benchmark its performance.
- [x] Parallelization: Parallelize the algorithm using OpenMP to speed up computation and benchmark its performance.
- [x] Parallelization: Parallelize the algorithm using C++ Concurrency API to speed up computation and benchmark its performance.
Parallelism
Optimizing Performance: Addressing the 99% Execution Time in 1% of the Code
| Procedure | Runtime | Percentage | | ----------------------| ---------- | -----------| | assignPoints | 0.011691 | 0.9538% | | mark_ingrid_corecell | 1.68E-06 | 0.0001% | | mark_outgrid_corecell | 0.00318535 | 0.2599% | | expand | 1.21085 | 98.7861% |
Based on the provided data, it can be observed that the "expand" procedure takes up the majority of the runtime with 98.7861%. This indicates that improving the performance of the "expand" procedure should be the primary target for optimizing the overall performance of the system.
Meanwhile, the other procedures such as "assignPoints" and "mark_outgrid_corecell" take up a relatively small portion of the runtime, at 0.9538% and 0.2599% respectively. The "mark_ingrid_corecell" procedure takes up a negligible amount of runtime at 0.0001%.
Therefore, optimizing the "expand" procedure is likely to yield the most significant improvements in the system's overall performance.

OpenMP and Analysis of Parallelism Bottleneck
Our initial attempt to parallelize the grid-based DBSCAN algorithm(n:50000, cluster:3, eps:0.2, minPts:2) showed that we achieved close to 2x speedup using 2 threads, but we didn't observe further benefits with more threads. Parallelizing grid-based DBSCAN is challenging due to load balancing, communication overhead, and maintaining correctness, but can be done with careful consideration.

This perf report shows the percentage of CPU cycles spent on different functions during the execution of the dbscan program. The dist function in the dbscan code accounts for 27.50% of the CPU cycles. The malloc function in the libc library accounts for 29.23% of the CPU cycles, followed by _int_free and isConnect. The report also shows that the operator new function in the libstdc++ library and the cfree function in the libc library account for a significant portion of the CPU cycles. This suggests that memory allocation and deallocation may be a bottleneck in the program's performance.
Samples: 347K of event 'cycles:u', Event count (approx.): 282010497999
Overhead Command Shared Object Symbol
29.23% dbscan libc-2.28.so [.] malloc
27.50% dbscan dbscan [.] dist
16.92% dbscan libc-2.28.so [.] _int_free
9.06% dbscan dbscan [.] isConnect
5.84% dbscan libstdc++.so.6.0.28 [.] operator new
3.78% dbscan libc-2.28.so [.] cfree@GLIBC_2.2.5
3.09% dbscan libc-2.28.so [.] __memmove_avx_unaligned_erms
1.53% dbscan libstdc++.so.6.0.28 [.] operator delete@plt
1.12% dbscan dbscan [.] memmove@plt
0.89% dbscan libstdc++.so.6.0.28 [.] malloc@plt
The perf stat looks normal in terms of the performance counter statistics. It provides information on the time elapsed, CPU utilization, context-switches, page-faults, cycles, stalled-cycles-frontend and stalled-cycles-backend, instructions, branches, and branch-misses.
Performance counter stats for './dbscan 0.2 2':
87,602.85 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
11,303 page-faults:u # 0.129 K/sec
284,992,059,313 cycles:u # 3.253 GHz (50.00%)
13,553,401 stalled-cycles-frontend:u # 0.00% frontend cycles idle (50.00%)
4,873,430,201 stalled-cycles-backend:u # 1.71% backend cycles idle (50.00%)
925,491,792,090 instructions:u # 3.25 insn per cycle
# 0.01 stalled cycles per insn (50.00%)
226,426,881,223 branches:u # 2584.698 M/sec (50.00%)
1,155,082 branch-misses:u # 0.00% of all branches (50.00%)
87.603927362 seconds time elapsed
87.006668000 seconds user
0.022818000 seconds sys
C++ Concurrency API
C++ concurrency API is a set of tools and libraries added to the C++ Standard Library to support writing concurrent and parallel code in C++. It was introduced in the C++11 standard, which was released in 2011. The API includes features like threads, mutexes, condition variables, futures, promises, and atomics, which allow developers to create and manage threads, protect shared resources, coordinate task execution, and synchronize asynchronous operations. The API is designed to be portable and efficient, but requires careful attention to detail and a good understanding of concurrency concepts.
here are a few examples of what C++ concurrency can do that OpenMP cannot:
-
Fine-grained control over thread creation and management: C++ concurrency allows developers to create and manage threads at a lower level of abstraction than OpenMP. This can be useful when a project requires fine-tuned control over the number of threads created, the affinity of threads to specific cores, or the lifecycle of threads.
-
Custom synchronization primitives: C++ concurrency provides a rich set of synchronization primitives, such as mutexes, condition variables, and semaphores, which can be used to protect shared resources and coordinate the execution of tasks. These primitives offer more flexibility and control than the OpenMP synchronization constructs, which are based on atomic operations and barriers.
-
Custom memory management: C++ concurrency allows developers to customize the memory allocation and deallocation strategies used by their concurrent code. This can be useful when a project needs to optimize memory usage or avoid contention on shared memory pools.
-
Custom scheduling policies: C++ concurrency provides a flexible framework for implementing custom thread scheduling policies, which can be used to optimize the performance of specific applications or workloads. OpenMP, in contrast, provides a fixed set of scheduling policies that may not be optimal for all situations.
Overall, the ability to finely control thread creation, synchronization, memory management, and scheduling is a key advantage of C++ concurrency over OpenMP, especially for projects that require high-performance and low-level optimization.
As a results, we choose to implement the C++ concurrency API to check if the performance can be further improved. We test the performance on this environment settin
Related Skills
node-connect
339.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.8kCommit, push, and open a PR
