PPQSort
PPQSort - Parallel Pattern Quicksort.
Install / Use
/learn @GabTux/PPQSortREADME
PPQSort (Parallel Pattern QuickSort)
Parallel Pattern Quicksort (PPQSort) is a efficient implementation of parallel quicksort algorithm, written by using only the C++20 features without using third party libraries (such as Intel TBB). PPQSort draws inspiration from pdqsort, BlockQuicksort and cpp11sort and adds some further optimizations.
- Focus on ease of use: Using only C++20 features, header only implementation, user-friendly API.
- Comprehensive test suite: Ensures correctness and robustness through extensive testing.
- Benchmarks shows great performance: Achieves impressive sorting times on various machines.
Integration
PPQSort is header only implementation. All the files needed are in include directory.
Add to existing CMake project using CPM.cmake:
include(cmake/CPM.cmake)
CPMAddPackage(
NAME PPQSort
GITHUB_REPOSITORY GabTux/PPQSort
VERSION 1.0.6 # change this to latest commit or release tag
)
target_link_libraries(YOUR_TARGET PPQSort::PPQSort)
Alternatively use FetchContent or just checkout the repository and add the include directory to the linker flags.
Usage
PPQSort has similiar API as std::sort, you can use ppqsort::execution::<policy> policies to specify how the sort should run.
// run parallel
ppqsort::sort(ppqsort::execution::par, input.begin(), input.end());
// Specify number of threads
ppqsort::sort(ppqsort::execution::par, input.begin(), input.end(), 16);
// provide custom comparator
ppqsort::sort(ppqsort::execution::par, input.begin(), input.end(), cmp);
// force branchless variant
ppqsort::sort(ppqsort::execution::par_force_branchless, input_str.begin(), input_str.end(), cmp);
PPQSort will by default use C++ threads, but if you prefer, you can link it with OpenMP and it will use OpenMP as a parallel backend. However you can still enforce C++ threads parallel backend even if linked with OpenMP:
#define FORCE_CPP
#include <ppqsort.h>
// ... rest of the code ...
Benchmark
We compared PPQSort with various parallel sorts. Benchmarks shows, that the PPQSort is one of the fastest parallel sorting algorithms across various input data and different machines.
| Name | Algorithm | Memory usage | External dependencies | Highlight | |---------------------------------|-------------------|--------------|----------------------------------------------------------------------|-----------------------------------------------------| | PPQSort | Quicksort | in-place | None | parallel pattern quicksort algorithm | | GCC BQS | Quicksort | in-place | OpenMP | allocating threads proportionally to subtask sizes | | cpp11sort | Quicksort | in-place | None | Header-only, C++11 compliant | | oneTBB parallel_sort | quicksort | out-place | oneTBB | Splits input to small tasks | | poolSTL sort | Quicksort | in-place | None | Header-only, C++17 compliant | | Boost block_indirect_sort | merging algorithm | out-place | Boost | Upper bounded small memory usage | | AQsort | Quicksort | in-place | OpenMP | Allows the sorting of multiple datasets at once | | MPQsort | Quicksort | in-place | OpenMP | Multiway Parallel Quicksort | | IPS4o | Samplesort | in-place | oneTBB | Divides data into buckets and sort them recursively |
Running on ARM cluster
- Fujitsu A64FX CPU
- NUMA architecture, 48 cores (4CPUs x 12cores)
Results for INT, input size was 2e9 (2 billions):
| Algorithm | Random | Ascending | Descending | Rotated | OrganPipe | Heap | Total | Rank | |-------------|----------------------------------|-------------------------------------|--------------------------------------|-----------------------------------|-------------------------------------|--------------------------------|---------------------------------|--------------------------------| | PPQSort C++ | 5.84s | 1.84s | 4.55s | 1.38s | 2.96s | 5.58s | 22.15s | 1 | | GCC BQS | 13.72s | 4.18s | 19.11s | 49.89s | 8.24s | 13.78s | 108.92s | 6 | | oneTBB | 43.66s | 0.09s | 8.62s | 13.84s | 8.12s | 43.9s | 118.23s | 9 | | poolSTL | 34.63s | 5.61s | 7.23s | 14.78s | 7.81s | 46.88s | 116.94s | 7 | | MPQsort | 13.35s | 5.74s | 5.77s | 4.67s | 7.71s | 12.87s | 50.11s | 5 | | cpp11sort | 9.58s | 2.47s | 2.66s | 5.47s | 3.42s | 9.9s | 33.5s | 3 | | AQsort | 24.72s | 3.66s | 23.14s | 21.83s | 22.6s | 25.31s | 121.26s | 8 | | Boost | 8.2s | 3.0s | 4.26s | 13.96s | 6.97s | 7.92s | 44.31s | 4 | | IPS$^4$o | 4.8s | 0.19s | 5.97s | 5.21s | 5.59s | 4.91s | 26.67s | 2 |
Summary
Extended benchmarks (detailed in forthcoming paper) shows that IPS4o (https://github.com/ips4o) often surpasses PPQSort in raw speed. However, IPS4o relies on the external library oneTBB (https://github.com/oneapi-src/oneTBB) introducing integration complexities. PPQSort steps up as a compelling alternative due to its:
- Competitive Speed: Delivers performance comparable to IPS4o on most machines.
- Hardware Agnostic: Maintains strong performance across various hardware, potentially surpassing IPS4o on specific systems, especially ARM platforms.
- Dependency-Free: No external libraries are required, simplifying integration. For applications demanding a fast, dependency-free parallel sorting solution, PPQSort is an excellent choice.
PPQSort is also used in tracy project.
Running Tests and Benchmarks
Bash script for running or building specific components:
$ scripts/build.sh all
...
$ scripts/run.sh standalone
...
Note that the benchmark's CMake file will by default download sparse matrices (around 26GB).
Im
Related Skills
node-connect
352.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
