SkillAgentSearch skills...

Dbscan

An optimized implementation of the DBSCAN algorithm in C++. The grid-based approach is optimized using parallelism, including OpenMP, C++ Concurrency API, and work-stealing technique. AVX instructions are used to optimize floating-point arithmetic operations. The program achieves a 35.89x runtime improvement with 8 threads.

Install / Use

/learn @weilun-chiu/Dbscan
About this skill

Quality Score

0/100

Supported Platforms

Zed

README

C++ Parallelized Grid-based DBSCAN

This project provides an optimized implementation of the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm in C++. DBSCAN is a popular unsupervised machine learning algorithm used for clustering and outlier detection. The project includes a naive approach and a grid-based approach, and utilizes parallelism to improve the algorithm's performance.

Table of Contents

Introduction

The aim of this project is to provide an efficient and scalable implementation of the DBSCAN algorithm in C++. The grid-based approach is optimized by using a 2D grid to index data points and speeding up the neighbor search process. Parallelism is utilized to accelerate the algorithm's performance, including OpenMP and C++ Concurrency API. Additionally, we have used AVX instructions to optimize floating-point arithmetic operations in the program.

Usage

To make use of this project, you can add the DBSCAN header file #include "dbscan.h" to your C++ project and create instances of the provided dbscan object types listed below. These objects take your data as an input parameter, and the result will be a vector containing cluster labels for every point in your dataset.

  • NaiveDBSCAN
  • SerialGridDBSCAN
  • OMPGridDBSCAN
  • ConcurrencyGridDBSCAN
  • ConcurrencyStealingGridDBSCAN
  • ConcurrencyStealingAVX2GridDBSCAN

To build

cd build
cmake ..
make

Example

The main.cpp file serves as the primary example of usage, while the test.cpp file exhibits the results of the unit tests.

    // Data generations
    DataGenerator generator(2, 100000);
    auto cluster_data = generator.cluster_distribution(3);
    
    SerialGridDBSCAN serial_grid_dbscan(eps, minPts);
    serial_grid_dbscan.run(cluster_data);

    OMPGridDBSCAN omp_grid_dbscan(eps, minPts);
    omp_grid_dbscan.run(cluster_data);

    ConcurrencyGridDBSCAN concurrency_grid_dbscan(eps, minPts);
    concurrency_grid_dbscan.run(cluster_data);

    ConcurrencyStealingGridDBSCAN concurrency_stealing_grid_dbscan(eps, minPts);
    concurrency_stealing_grid_dbscan.run(cluster_data);

Milestones

This project has three milestones:

  • [x] Naive Approach: Implement DBSCAN using a naive approach and benchmark its performance.
  • [x] Grid-Based Approach: Optimize the algorithm by using a grid-based approach for neighbor search and benchmark its performance.
  • [x] Parallelization: Parallelize the algorithm using OpenMP to speed up computation and benchmark its performance.
  • [x] Parallelization: Parallelize the algorithm using C++ Concurrency API to speed up computation and benchmark its performance.

Parallelism

Optimizing Performance: Addressing the 99% Execution Time in 1% of the Code

| Procedure | Runtime | Percentage | | ----------------------| ---------- | -----------| | assignPoints | 0.011691 | 0.9538% | | mark_ingrid_corecell | 1.68E-06 | 0.0001% | | mark_outgrid_corecell | 0.00318535 | 0.2599% | | expand | 1.21085 | 98.7861% |

Based on the provided data, it can be observed that the "expand" procedure takes up the majority of the runtime with 98.7861%. This indicates that improving the performance of the "expand" procedure should be the primary target for optimizing the overall performance of the system.

Meanwhile, the other procedures such as "assignPoints" and "mark_outgrid_corecell" take up a relatively small portion of the runtime, at 0.9538% and 0.2599% respectively. The "mark_ingrid_corecell" procedure takes up a negligible amount of runtime at 0.0001%.

Therefore, optimizing the "expand" procedure is likely to yield the most significant improvements in the system's overall performance.

alt text

OpenMP and Analysis of Parallelism Bottleneck

Our initial attempt to parallelize the grid-based DBSCAN algorithm(n:50000, cluster:3, eps:0.2, minPts:2) showed that we achieved close to 2x speedup using 2 threads, but we didn't observe further benefits with more threads. Parallelizing grid-based DBSCAN is challenging due to load balancing, communication overhead, and maintaining correctness, but can be done with careful consideration.

alt text alt text

This perf report shows the percentage of CPU cycles spent on different functions during the execution of the dbscan program. The dist function in the dbscan code accounts for 27.50% of the CPU cycles. The malloc function in the libc library accounts for 29.23% of the CPU cycles, followed by _int_free and isConnect. The report also shows that the operator new function in the libstdc++ library and the cfree function in the libc library account for a significant portion of the CPU cycles. This suggests that memory allocation and deallocation may be a bottleneck in the program's performance.

Samples: 347K of event 'cycles:u', Event count (approx.): 282010497999
Overhead  Command  Shared Object        Symbol
  29.23%  dbscan   libc-2.28.so         [.] malloc     
  27.50%  dbscan   dbscan               [.] dist     
  16.92%  dbscan   libc-2.28.so         [.] _int_free   
   9.06%  dbscan   dbscan               [.] isConnect     
   5.84%  dbscan   libstdc++.so.6.0.28  [.] operator new   
   3.78%  dbscan   libc-2.28.so         [.] cfree@GLIBC_2.2.5   
   3.09%  dbscan   libc-2.28.so         [.] __memmove_avx_unaligned_erms  
   1.53%  dbscan   libstdc++.so.6.0.28  [.] operator delete@plt      
   1.12%  dbscan   dbscan               [.] memmove@plt          
   0.89%  dbscan   libstdc++.so.6.0.28  [.] malloc@plt        

The perf stat looks normal in terms of the performance counter statistics. It provides information on the time elapsed, CPU utilization, context-switches, page-faults, cycles, stalled-cycles-frontend and stalled-cycles-backend, instructions, branches, and branch-misses.

 Performance counter stats for './dbscan 0.2 2':

         87,602.85 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            11,303      page-faults:u             #    0.129 K/sec                  
   284,992,059,313      cycles:u                  #    3.253 GHz                      (50.00%)
        13,553,401      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (50.00%)
     4,873,430,201      stalled-cycles-backend:u  #    1.71% backend cycles idle      (50.00%)
   925,491,792,090      instructions:u            #    3.25  insn per cycle         
                                                  #    0.01  stalled cycles per insn  (50.00%)
   226,426,881,223      branches:u                # 2584.698 M/sec                    (50.00%)
         1,155,082      branch-misses:u           #    0.00% of all branches          (50.00%)

      87.603927362 seconds time elapsed

      87.006668000 seconds user
       0.022818000 seconds sys

C++ Concurrency API

C++ concurrency API is a set of tools and libraries added to the C++ Standard Library to support writing concurrent and parallel code in C++. It was introduced in the C++11 standard, which was released in 2011. The API includes features like threads, mutexes, condition variables, futures, promises, and atomics, which allow developers to create and manage threads, protect shared resources, coordinate task execution, and synchronize asynchronous operations. The API is designed to be portable and efficient, but requires careful attention to detail and a good understanding of concurrency concepts.

here are a few examples of what C++ concurrency can do that OpenMP cannot:

  • Fine-grained control over thread creation and management: C++ concurrency allows developers to create and manage threads at a lower level of abstraction than OpenMP. This can be useful when a project requires fine-tuned control over the number of threads created, the affinity of threads to specific cores, or the lifecycle of threads.

  • Custom synchronization primitives: C++ concurrency provides a rich set of synchronization primitives, such as mutexes, condition variables, and semaphores, which can be used to protect shared resources and coordinate the execution of tasks. These primitives offer more flexibility and control than the OpenMP synchronization constructs, which are based on atomic operations and barriers.

  • Custom memory management: C++ concurrency allows developers to customize the memory allocation and deallocation strategies used by their concurrent code. This can be useful when a project needs to optimize memory usage or avoid contention on shared memory pools.

  • Custom scheduling policies: C++ concurrency provides a flexible framework for implementing custom thread scheduling policies, which can be used to optimize the performance of specific applications or workloads. OpenMP, in contrast, provides a fixed set of scheduling policies that may not be optimal for all situations.

Overall, the ability to finely control thread creation, synchronization, memory management, and scheduling is a key advantage of C++ concurrency over OpenMP, especially for projects that require high-performance and low-level optimization.

As a results, we choose to implement the C++ concurrency API to check if the performance can be further improved. We test the performance on this environment settin

Related Skills

View on GitHub
GitHub Stars9
CategoryDevelopment
Updated20h ago
Forks1

Languages

C++

Security Score

90/100

Audited on Mar 27, 2026

No findings