SimdQSort
Quicksort Algorithm implemented using the AVX2 instruction set
Install / Use
/learn @mikerabat/SimdQSortREADME
SIMD Quicksort
This project implements a SIMD Quicksort algorithm that is AVX2 based. It expects input arrays that consists either of int32, int64, single (FP32) and double (FP64) elements.
The project is working with Delphi (2010 and up) and FPC in both intel flavors - x64 and x86. It is based on the work of "WojciechMula" who implemented various approaches on the SIMD Quicksort approach for int32 arrays. Check out https://github.com/WojciechMula/simd-sort for the C - version.
The algorithm here uses a left, right bound and within that bound it has 4 read write pointers.
wh--rh------------------rt--wt-
wh = write head rh = read head rt = read tail wt = write tail
The larger distance (rh - wh) (rt - wt) is used to read the next vector - this one is compared against the pivot and permuted according to the shuffle masks - then written back if a full vector has been found.
Reading from the shuffle mask is a crucial step here so the list is packed (bytes) and read via the vpmovzxbd command.
I left out the AVX512 implementations since I do not have a cpu that implements these opcodes. Newer (mobile) cpu's like the Intel ULtra 7 155H focus more on AI extensions than AVX512.
Usage
To use the code one needs to add the AVXQuickSortxy.pas to the project and add source search path to the library path so the dependencies are loaded too. This enables AVX encoding/decoding in case the AVX2 instruction set is avaliabe.
The test directory holds two projects:
- TestSort.dpr - A small unit test tool to check if the sorting is correct
- Benchmark.dpr - Benchmarking the AVX implementation against the standard implementation.
The projects work for Delphi versions 2010 up to Delphi 12.2 (most likely 12.3 too) as well as FPC.
Stats
The files were tested against the standard Delphi implementation which can be found in Quicksort.pas I created a small benchmark app that shows a speedup for random data of about 1.5 to 3.5 than the standard implementation.
Design considerations
Unfortunately Delphi does not really support the fancy intrinsics so the code needed to be pure asm and replicated.
The project could use global constants but the disassembler could not handle that well so there is another variable passed to the en/decoding routines that hold a pointer to the constants.
I basically used the disassembly routines from https://github.com/mikerabat/mrmath to get the equivalent "db instructions" for the AVX set that are missing up until Delphi 11.3.
Related Skills
node-connect
352.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.5kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
