SkillAgentSearch skills...

Quadsort

Quadsort is a branchless stable adaptive mergesort faster than quicksort.

Install / Use

/learn @scandum/Quadsort
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Intro

This document describes a stable bottom-up adaptive branchless merge sort named quadsort. A visualisation and benchmarks are available at the bottom.

The quad swap analyzer

Quadsort starts out with an analyzer that has the following tasks:

  1. Detect ordered data with minimal comparisons.
  2. Detect reverse order data with minimal comparisons.
  3. Do the above without impacting performance on random data.
  4. Exit the quad swap analyzer with sorted blocks of 8 elements.

Ordered data handling

Quadsort's analyzer examines the array 8 elements at a time. It performs 4 comparisons on elements (1,2), (3,4), (5,6), and (7,8) of which the result is stored and a bitmask is created with a value between 0 and 15 for all the possible combinations. If the result is 0 it means the 4 comparisons were all in order.

What remains is 3 more comparisons on elements (2,3), (4,5), and (6,7) to determine if all 8 elements are in order. Traditional sorts would do this with 7 branched individual comparisons, which should result in 3.2 branch mispredictions on random data on average. Using quadsort's method should result in 0.2 branch mispredictions on random data on average.

If the data is in order quadsort moves on to the next 8 elements. If the data turns out to be neither in order or in reverse order, 4 branchless swaps are performed using the stored comparison results, followed by a branchless parity merge. More on that later.

Reverse order handling

Reverse order data is typically moved using a simple reversal function, as following.

int reverse(int array[], int start, int end, int swap)
{
    while (start < end)
    {
        swap = array[start];
        array[start++] = array[end];
        array[end--] = swap;
    }
}

While random data can only be sorted using n log n comparisons and n log n moves, reverse-order data can be sorted using n comparisons and n moves through run detection and a reversal routine. Without run detection the best you can do is sort it in n comparisons and n log n moves.

Run detection, as the name implies, comes with a detection cost. Thanks to the laws of probability a quad swap can cheat however. The chance of 4 separate pairs of elements being in reverse order is 1 in 16. So there's a 6.25% chance quadsort makes a wasteful run check.

What about run detection for in-order data? While we're turning n log n moves into n moves with reverse order run detection, we'd be turning 0 moves into 0 moves with forward run detection. So there's no point in doing so.

The next optimization is to write the quad swap analyzer in such a way that we can perform a simple check to see if the entire array was in reverse order, if so, the sort is finished.

At the end of the loop the array has been turned into a series of ordered blocks of 8 elements.

Ping-Pong Quad Merge

Most textbook mergesort examples merge two blocks to swap memory, then copy them back to main memory.

main memory ┌────────┐┌────────┐
            └────────┘└────────┘
                  ↓ merge ↓
swap memory ┌──────────────────┐
            └──────────────────┘
                  ↓ copy ↓
main memory ┌──────────────────┐
            └──────────────────┘

This doubles the amount of moves and we can fix this by merging 4 blocks at once using a quad merge / ping-pong merge like so:

main memory ┌────────┐┌────────┐┌────────┐┌────────┐
            └────────┘└────────┘└────────┘└────────┘
                  ↓ merge ↓           ↓ merge ↓
swap memory ┌──────────────────┐┌──────────────────┐
            └──────────────────┘└──────────────────┘
                            ↓ merge ↓
main memory ┌──────────────────────────────────────┐
            └──────────────────────────────────────┘

It is possible to interleave the two merges to swap memory for increased memory-level parallelism, but this can both increase and decrease performance.

Skipping

Just like with the quad swap it is beneficial to check whether the 4 blocks are in-order.

In the case of the 4 blocks being in-order the merge operation is skipped, as this would be pointless. Because reverse order data is handled in the quad swap there is no need to check for reverse order blocks.

This allows quadsort to sort in-order sequences using n comparisons instead of n * log n comparisons.

Parity merge

A parity merge takes advantage of the fact that if you have two n length arrays, you can fully merge the two arrays by performing n merge operations on the start of each array, and n merge operations on the end of each array. The arrays must be of equal length. Another way to describe the parity merge would be as a bidirectional unguarded merge.

The main advantage of a parity merge over a traditional merge is that the loop of a parity merge can be fully unrolled.

If the arrays are not of equal length a hybrid parity merge can be performed. One way to do so is using n parity merges where n is the size of the smaller array, before switching to a traditional merge.

Branchless parity merge

Since the parity merge can be unrolled it's very suitable for branchless optimizations to speed up the sorting of random data. Another advantage is that two separate memory regions are accessed in the same loop, allowing memory-level parallelism. This makes the routine up to 2.5 times faster for random data on most hardware.

Increasing the memory regions from two to four can result in both performance gains and performance losses.

The following is a visualization of an array with 256 random elements getting turned into sorted blocks of 32 elements using ping-pong parity merges.

quadsort visualization

Cross merge

While a branchless parity merge sorts random data faster, it sorts ordered data slower. One way to solve this problem is by using a method with a resemblance to the galloping merge concept first introduced by timsort.

The cross merge works in a similar way to the quad swap. Instead of merging two arrays two items at a time, it merges four items at a time.

┌───┐┌───┐┌───┐    ┌───┐┌───┐┌───┐            ╭───╮  ┌───┐┌───┐┌───┐
│ A ││ B ││ C │    │ X ││ Y ││ Z │        ┌───│B<X├──┤ A ││ B ││C/X│
└─┬─┘└─┬─┘└───┘    └─┬─┘└─┬─┘└───┘        │   ╰─┬─╯  └───┘└───┘└───┘
  └────┴─────────────┴────┴───────────────┘     │  ╭───╮  ┌───┐┌───┐┌───┐
                                                └──│A>Y├──┤ X ││ Y ││A/Z│
                                                   ╰─┬─╯  └───┘└───┘└───┘
                                                     │    ┌───┐┌───┐┌───┐
                                                     └────│A/X││X/A││B/Y│
                                                          └───┘└───┘└───┘

When merging ABC and XYZ it first checks if B is smaller or equal to X. If that's the case A and B are copied to swap. If not, it checks if A is greater than Y. If that's the case X and Y are copied to swap.

If either check is false it's known that the two remaining distributions are A X and X A. This allows performing an optimal branchless merge. Since it's known each block still has at least 1 item remaining (B and Y) and there is a high probability of the data to be random, another (sub-optimal) branchless merge can be performed.

In C this looks as following:

while (l < l_size - 1 && r < r_size - 1)
{
    if (left[l + 1] <= right[r])
    {
        swap[s++] = left[l++];
        swap[s++] = left[l++];
    }
    else if (left[l] > right[r + 1])
    {
        swap[s++] = right[r++];
        swap[s++] = right[r++];
    }
    else
    {
        a = left[l] > right[r];
        x = !a;
        swap[s + a] = left[l++];
        swap[s + x] = right[r++];
        s += 2;
    }
}

Overall the cross merge gives a decent performance gain for both ordered and random data, particularly when the two arrays are of unequal length. When two arrays are of near equal length quadsort looks 8 elements ahead, and performs an 8 element parity merge if it can't skip ahead.

Merge strategy

Quadsort will merge blocks of 8 into blocks of 32, which it will merge into blocks of 128, 512, 2048, 8192, etc.

For each ping-pong merge quadsort will perform two comparisons to see if it will be faster to use a parity merge or a cross merge, and pick the best option.

Tail merge

When sorting an array of 33 elements you end up with a sorted array of 32 elements and a sorted array of 1 element in the end. If a program sorts in intervals it should pick an optimal array size (32, 128, 512, etc) to do so.

To minimalize the impact the remainder of the array is sorted using a tail merge.

Big O

                 ┌───────────────────────┐┌───────────────────────┐
                 │comparisons            ││swap memory            │
┌───────────────┐├───────┬───────┬───────┤├───────┬───────┬───────┤┌──────┐┌─────────┐┌─────────┐
│name           ││min    │avg    │max    ││min    │avg    │max    ││stable││partition││adaptive │
├───────────────┤├───────┼───────┼───────┤├───────┼───────┼───────┤├──────┤├─────────┤├─────────┤
│mergesort      ││n log n│n log n│n log n││n      │n      │n      ││yes   ││no       ││no       │
├───────────────┤├───────┼───────┼───────┤├───────┼───────┼───────┤├──────┤├─────────┤├─────────┤
│quadsort       ││n      │n log n│n log n││1      │n      │n      ││yes   ││no       ││yes      │
├───────────────┤├───────┼───────┼───────┤├───────┼───────┼───────┤├──────┤├─────────┤├─────────┤
│quicksort      ││n log n│n log n│n²     ││1      │1      │1      ││no    ││yes      ││no       │
└───────────────┘└───────┴───────┴───────┘└───────┴───────┴───────┘└──────┘└─────────┘└─────────┘

Quadsort makes n comparison

Related Skills

View on GitHub
GitHub Stars2.2k
CategoryDevelopment
Updated19h ago
Forks110

Languages

C

Security Score

100/100

Audited on Mar 27, 2026

No findings