Miniselect
Selection and partial sorting algorithms
Install / Use
/learn @danlark1/MiniselectREADME
miniselect: Generic selection and partial ordering algorithms
miniselect is a C++ header-only library that contains various generic selection
and partial sorting algorithms with the ease of use, testing, advice on usage and
benchmarking.
Sorting is everywhere and there are many outstanding sorting algorithms that compete in speed, comparison count and cache friendliness. However, selection algorithms are always a bit outside of the competition scope, they are pretty important, for example, in databases ORDER BY LIMIT N is used extremely often which can benefit from more optimal selection and partial sorting algorithms. This library tries to solve this problem with Modern C++.
- Easy: First-class, easy to use dependency and carefully documented APIs and algorithm properties.
- Fast: We do care about speed of the algorithms and provide reasonable implementations.
- Standard compliant: We provide C++11 compatible APIs that are compliant to the standard
std::nth_elementandstd::partial_sortfunctions including custom comparators and order guarantees. Just replace the names of the functions in your project and it should work! - Well tested: We test all algorithms with a unified framework, under sanitizers and fuzzing.
- Benchmarked: We gather benchmarks for all implementations to better understand good and bad spots.
Table of Contents
- Quick Start
- Testing
- Documentation
- Performance results
- Real-world usage
- Contributing
- Motivation
- License
Quick Start
You can either include this project as a cmake dependency and then use the headers that are provided in the include folder or just pass the include folder to your compiler.
#include <iostream>
#include <vector>
#include "miniselect/median_of_ninthers.h"
int main() {
std::vector<int> v = {1, 8, 4, 3, 2, 9, 0, 7, 6, 5};
miniselect::median_of_ninthers_select(v.begin(), v.begin() + 5, v.end());
for (const int i : v) {
std::cout << i << ' ';
}
return 0;
}
// Compile it `clang++/g++ -I$DIRECTORY/miniselect/include/ example.cpp -std=c++11 -O3 -o example
// Possible output: 0 1 4 3 2 5 8 7 6 9
// ^ on the right place
Examples can be found in examples.
We support all compilers starting from GCC 7 and Clang 6. We are also planning to support Windows, for now it is best effort but no issues are known so far.
More on which algorithms are available, see documentation. For overview of this work you can read the article in the author's blog.
Testing
To test and benchmark, we use Google benchmark library. Simply do in the root directory:
# Check out the libraries.
$ git clone https://github.com/google/benchmark.git
$ git clone https://github.com/google/googletest.git
$ mkdir build && cd build
$ cmake -DMINISELECT_TESTING=on -DBENCHMARK_ENABLE_GTEST_TESTS=off -DBENCHMARK_ENABLE_TESTING=off ..
$ make -j
$ ctest -j4 --output-on-failure
It will create two tests and two benchmarks test_sort, test_select,
benchmark_sort, benchmark_select. Use them to validate or contribute. You
can also use ctest.
Documentation
There are several selection algorithms available, further $n$ is the number of elements in the array, $k$ is the selection element that is needed to be found (all algorithms are deterministic and not stable unless otherwise is specified):
| Name | Average | Best Case | Worst Case | Comparisons | Memory |
|------------------------- |--------------------------------------------------------------------------------------------------------- |--------------------------------------------------------------------------------------------------------- |----------------------------------------------------------------------------------------------------------------------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |--------------------------------------------------------------------------------------------------------------------------------- |
| pdqselect | $O(n)$ | $O(n)$ | $O(n\log n)$ | At least $2n$. Random data $2.5n$ | $O(1)$ |
| Floyd-Rivest | $O(n)$ | $O(n)$ | $O(n^2)$ | Avg: $n + \min(k, n - k) + O(\sqrt{n \log n})$ | $O(\log\log n)$ |
| Median Of Medians | $O(n)$ | $O(n)$ | $O(n)$ | Between $2n$ and $22n$. Random data $2.5n$ | $O(\log n)$ |
| Median Of Ninthers | $O(n)$ | $O(n)$ | $O(n)$ | Between $2n$ and $21n$. Random data $2n$ | $O(\log n)$ |
| Median Of 3 Random | $O(n)$ | $O(n)$ | $O(n^2)$ | At least $2n$. Random data $3n$ | $O(\log n)$ |
| HeapSelect | $O(n\log k)$ | $O(n)$ | $O(n\log k)$ | $n\log k$ on average, for some data patterns might be better | $O(1)$ |
| libstdc++ (introselect) | $O(n)$ | $O(n)$ | $O(n\log n)$ | At least $2n$. Random data $3n$ | $O(1)$ |
| libc++ (median of 3) | $O(n)$ | $O(n)$ | $O(n^2)$ | At least $2n$. Random data $3n$ | $O(1)$ |
For sorting the situation is similar except every line adds $O(k\log k)$ comparisons and pdqselect is using $O(\log n)$ memory.
API
All functions end either in select, either in partial_sort and
their behavior is exactly the same as for
std::nth_element
and std::partial_sort
respectively, i.e. they accept 3 arguments as first, middle, end iterators
and an optional comparator. Several notes:
- You should not throw exceptions from
Comparefunction. Standard library also does not specify the behavior in that matter. - We don't support ParallelSTL for now.
- C++20 constexpr specifiers might be added but currently we don't have them because of some floating point math in several algorithms.
- All functions are in the
miniselectnamespace. See the example for that.
- pdqselect
- This algorithm is based on
pdqsortwhich is acknowledged as one of the fastest generic sort algorithms. - Location:
miniselect/pdqselect.h. - Functions:
pdqselect,pdqselect_branchless,pdqpartial_sort,pdqpartial_sort_branchless. Branchless version uses branchless partition algorithm provided bypdqsort. Use it if your comparison function is branchless, it might give performance for very big ranges. - Performance advice: Use it when you need to sort a big chunk so that $k$ is close to $n$.
- This algorithm is based on
- Floyd-Rivest
- This algorithm is based on Floyd-Rivest algorithm.
- Location:
miniselect/floyd_rivest_select.h. - Functions:
floyd_rivest_select,floyd_rivest_partial_sort. - Performance advice: Given that this algorithm perfor
