TurtleSort

Multi-heap-sort for many small arrays, quicksort with 3 pivots for one big array, CUDA acceleration, CUDA memory compression.

Generate Convert Improve

Install / Use

/learn @tugrul512bit/TurtleSort

About this skill

Quality Score

0/100

README

TurtleSort

Quicksort algorithm boosted with optional paths for different sized data chunks with different sorting algorithms while keeping the "quick" part as main coordinator. Uses CUDA for accelerating parallel algorithms such as reductions, counting and others.

How it works

Features:

Uses 3 pivots instead of 1. Each pivot has counting-sort type of optimization and filters all duplicates of pivots for a good speedup. Duplicated elements in array make the sorting faster. Pivots are chosen from start, middle and stop indices.
CUDA acceleration with dynamic parallelism enables asynchronous computation to CPU. CPU can do more work at the same time with full bandwidth of RAM.
CUDA compression improves performance for redundancy in data.(work in progress)
Supports different data types such as int, unsigned char, long, short, etc
Already-sorted arrays are 2x slow compared to random-initialized arrays. Being only 2x slower for 64M suggests that either the constant factor in worst-case O(n^2 x c) is very good or it is not O(n^2) anymore due to said optimizations.

Performance

Up to 25x faster than single-thread std::qsort
Up to 15x faster than single-thread std::sort
Up to 3x faster than 24-thread std::sort(par_unseq,a,b)
Runs faster when array elements have duplicates (but std::sort runs even faster)
Same speed with sorted array (but std::sort runs faster so theres only 2x speedup against std::sort)

Test system: RTX4070, Ryzen7900, DDR5-6000 dual-channel RAM.

Requirements

GPU: Nvidia with dynamic-parallelism + CUDA 12 support.
Video-memory: 2GB per 32M elements for index-tracked version
RAM: only value vector and index vector are used as i/o.
C++17 supporting CUDA compiler

Compiler options:

generate relocatable device code (-rdc=true)
C++ compiler's linkage for cuda.lib (this uses CUDA driver api to enable features such as compressible memory)
compute_89, sm_89 for latest RTX cards
-lineinfo for debugging
host: /Ox highest optimization level
release mode, x64 selected

Methods

sorter.MultiSort<Type,arrSize, blockSize>(numArrays, hostPtr): sorts numArrays number of arrays each with arrSize number of elements. Doesn't support index-tracking yet.

sortVal.StartSorting(&hostVector);: sorts hostVector

sortVal.StartSorting(&hostVector,&hostIndexVector);: sorts hostVector and rearranges hostIntexVector accordingly so index values can be tracked.

Sample Code

#include"turtle-sort.cuh" 

// test program
int main()
{
    using Type = int;

    // maximum array length to sort
    const int n = 1000;
    Turtle::TurtleSort<Type> sortVal(n);
    std::vector<Type> sample = { 5,4,3,9,8,1 };
    sortVal.StartSorting(&sample);


    sortVal.Sync();
    for (auto& e : sample)
        std::cout << e << " ";
    std::cout << std::endl;

    return 0;
}

output:

Asynchronous computing...
1 3 3 5 7

Sample Benchmark Code

#include"turtle-sort.cuh" 
#include <execution>
#include<algorithm>

// test program
int main()
{


    using Type = int;
    const int n = 12*1024*1024;


    // n: number of elements supported for sorting
    // compress: (if possible) enables nvidia's compressible memory to possibly increase effective bandwidth/cache capacity
    bool compress = true;

    Turtle::TurtleSort<Type> sortVal(n, compress);
    std::vector<Type> sample = { 5,4,3,9,8,1 };
    sortVal.StartSorting(&sample);
    sortVal.Sync();
    for (auto& e : sample)
        std::cout << e << " ";
    std::cout << std::endl;
    std::cout << "Memory compression successful=" << sortVal.MemoryCompressionSuccessful() << std::endl;


    // compression disabled by default
    Turtle::TurtleSort<Type> sort(n, compress);
    std::cout << "Check GPU boost frequency if performance drops." << std::endl;

    // sample
    std::vector<Type> hostData(n);
    std::vector<int> hostIndex(n);

    // sample for std::sort, std::qsort
    struct StdData
    {
        Type data;
        int index;
    };
    std::vector<StdData> backup(n), backup2(n), backup3(n);
    for (int j = 0; j < 10; j++)
    {
        std::cout << "-------------------------" << std::endl;
        for (int i = 0; i < n; i++)
        {
            hostData[i] = rand();
            hostIndex[i] = hostData[i];
            backup[i].data = hostData[i];
            backup2[i].data = hostData[i];
            backup[i].index = hostData[i];
            backup2[i].index = hostData[i];
            backup3[i].data = hostData[i];
            backup3[i].index = hostData[i];
        }

        size_t t1, t2, t3, t4;
        {
            Turtle::Bench bench(&t1);
            sort.StartSorting(&hostData, &hostIndex);
            double t = sort.Sync();

        }


        {
            Turtle::Bench bench(&t2);
            std::qsort
            (
                backup.data(),
                backup.size(),
                sizeof(decltype(backup)::value_type),
                [](const void* x, const void* y)
                {
                    const int arg1 = static_cast<const StdData*>(x)->data;
                    const int arg2 = static_cast<const StdData*>(y)->data;

                    if (arg1 < arg2)
                        return -1;
                    if (arg1 > arg2)
                        return 1;
                    return 0;
                }
            );
        }

        {
            Turtle::Bench bench(&t3);
            std::sort(backup2.begin(), backup2.end(), [](auto& e1, auto& e2) { return e1.data < e2.data; });
        }
        {
            Turtle::Bench bench(&t4);
            std::sort(std::execution::par_unseq, backup3.begin(), backup3.end(), [](auto& e1, auto& e2) { return e1.data < e2.data; });
        }
        std::cout << "gpu: " << t1 / 1000000000.0 <<
            "   std::qsort:" << t2 / 1000000000.0 <<
            "   std::sort:" << t3 / 1000000000.0 <<
            "   std::sort(par_unseq):" << t4 / 1000000000.0 <<
            std::endl;
        bool err = false, err2 = false;


        for (int i = 0; i < n - 1; i++)
            if (hostData[i] > hostData[i + 1])
            {
                std::cout << "error at: " << i << ": " << hostData[i] << " " << hostData[i + 1] << " " << hostData[i + 2] << std::endl;
                err = true;
                j = 1000000;
                return 1;
            }


        for (int i = 0; i < n; i++)
        {
            if (hostData[i] != hostIndex[i])
            {
                err2 = true;
                j = 1000000;
                std::cout << "error: index calculation wrong" << std::endl;
                return 1;
            }
        }
        if (!err && !err2)
        {
            std::cout << "quicksort (" << n << " elements) completed successfully " << std::endl;
        }


    }
    return 0;
}

output on ryzen7900 and rtx4090:

gpu: 0.036844   std::qsort:0.775426   std::sort:0.430385   std::sort(par_unseq):0.0875657
quicksort (12582912 elements) completed successfully
-------------------------
gpu: 0.0334487   std::qsort:0.766197   std::sort:0.429026   std::sort(par_unseq):0.0793185
quicksort (12582912 elements) completed successfully
-------------------------
gpu: 0.0331019   std::qsort:0.777763   std::sort:0.430848   std::sort(par_unseq):0.0780624
quicksort (12582912 elements) completed successfully
-------------------------
gpu: 0.0329223   std::qsort:0.77231   std::sort:0.424476   std::sort(par_unseq):0.0859999
quicksort (12582912 elements) completed successfully
-------------------------
gpu: 0.0323176   std::qsort:0.765149   std::sort:0.431423   std::sort(par_unseq):0.0798207
quicksort (12582912 elements) completed successfully
-------------------------
gpu: 0.148611 <---- GPU boost is disabled by driver due to staying idle (waiting for other sorters)
quicksort (12582912 elements) completed successfully
-------------------------
gpu: 0.162047   <---- GPU boost is disabled by driver due to staying idle (waiting for other sorters)
quicksort (12582912 elements) completed successfully
-------------------------
gpu: 0.0333234   std::qsort:0.770051   std::sort:0.431438   std::sort(par_unseq):0.0776
quicksort (12582912 elements) completed successfully
-------------------------
gpu: 0.151563    <---- GPU boost is disabled by driver due to staying idle (waiting for other sorters)
quicksort (12582912 elements) completed successfully
-------------------------
gpu: 0.0327161   std::qsort:0.771654   std::sort:0.43577   std::sort(par_unseq):0.0918065
quicksort (12582912 elements) completed successfully

15x std::sort performance, 2.3x against 24-thread std::sort(std::execution::par_unseq)!

CUDA Compressible Memory Test Result

because processing a sorted array is faster

Multiple Sorting Support

#include"turtle-sort.cuh"

// test program
int main()
{
    using Type = int;

    constexpr int arrSize = 10;
    std::cout << "test" << std::endl;

    // number of cuda threads per block
    constexpr int blockSize = 64;
    int numArraysToSort = 100000 * blockSize; // has to be multiple of blockSize

    int n = arrSize * numArraysToSort;

    bool compress = false;
    Turtle::TurtleSort<Type> sorter(n, compress);
    std::vector<Type> hostData(n);

    for (int k = 0; k < 10; k++)
    {
        for (int i = 0; i < n; i++)
        {
            hostData[i] = rand();
        }



        double seconds = sorter.MultiSort<Type,arrSize, blockSize>(numArraysToSort, hostData.data());
        std::cout << "Sorting " << numArraysToSort << " arrays of " << arrSize << " elements took " << seconds << " seconds" << std::endl;
        for (int i = 0; i < numArraysToSort; i++)
        {
            for (int j = 0; j < arrSize - 1; j++)
            {
                if (hostData[i * a

Related Skills

node-connect

346.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

107.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

346.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

346.8k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。