AcceleratedKernels.jl
Cross-architecture parallel algorithms for Julia's CPU and GPU backends. Targets multithreaded CPUs, and GPUs via Intel oneAPI, AMD ROCm, Apple Metal, Nvidia CUDA.
Install / Use
/learn @JuliaGPU/AcceleratedKernels.jlREADME
"We need more speed" - Lightning McQueen or Scarface, I don't know
Parallel algorithm building blocks for the Julia ecosystem, targeting multithreaded CPUs, and GPUs via Intel oneAPI, AMD ROCm, Apple Metal and Nvidia CUDA (and any future backends added to the JuliaGPU organisation) from a unified KernelAbstractions.jl codebase.
A Uniform API, Everywhere
Offering standard library algorithms (e.g., sort, mapreduce, accumulate), higher-order functions (e.g., sum, cumprod, any), and cross-architecture custom loops (foreachindex, foraxes), AcceleratedKernels.jl lets you write high-performance code once and run it on all supported architectures — no separate or special-cased kernels needed. It’s the classic “write once, run everywhere” principle, but supercharged for modern parallel CPU and GPU computing.
CPU Single- and Multi-Threaded
</td> <td>Julia LTS, Stable, Pre-Release
x86, x64, aarch64
Windows, Ubuntu, MacOS
</td> <td> </td> </tr> <tr> <td rowspan=2> </td> <td>Julia v1.10
</td> <td> </td> </tr> <tr> <td>Julia v1.11
</td> <td> </td> </tr> <tr> <td rowspan=2> </td> <td>Julia v1.10
</td> <td> </td> </tr> <tr> <td>Julia v1.11
</td> <td> </td> </tr> <tr> <td rowspan=2> </td> <td>Julia v1.10
</td> <td> </td> </tr> <tr> <td>Julia v1.11
</td> <td> </td> </tr> <tr> <td rowspan=2> </td> <td>Julia v1.10
</td> <td> </td> </tr> <tr> <td>Julia v1.11
</td> <td> </td> </tr> </table>- 1. What's Different?
- 2. Status
- 3. Benchmarks
- 4. Functions Implemented
- 5. API and Examples
- 6. Custom Structs
- 7. Testing
- 8. Issues and Debugging
- 9. Roadmap / Future Plans
- 10. References
- 11. Acknowledgements
- 12. License
1. What's Different?
As far as I am aware, this is the first cross-architecture parallel standard library from a unified codebase - that is, the code is written as KernelAbstractions.jl backend-agnostic kernels, which are then transpiled to a GPU backend; that means we benefit from all the optimisations available on the native platform and official compiler stacks. For example, unlike open standards like OpenCL that require GPU vendors to implement that API for their hardware, we target the existing official compilers. And while performance-portability libraries like Kokkos and RAJA are powerful for large C++ codebases, they require US National Lab-level development and maintenance efforts to effectively forward calls from a single API to other OpenMP, CUDA Thrust, ROCm rocThrust, oneAPI DPC++ libraries developed separately.
As a simple example, this is how a normal Julia for-loop can be converted to an accelerated kernel - for both multithreaded CPUs and Nvidia / AMD / Intel / Apple GPUs, with native performance - by changing a single line:
# Copy kernel testing throughput
function cpu_copy!(dst, src)
for i in eachindex(src)
dst[i] = src[i]
end
end
</td>
<td>
import AcceleratedKernels as AK
function ak_copy!(dst, src)
AK.foreachindex(src) do i
dst[i] = src[i]
end
end
</td>
</tr>
</table>
Again, this is only possible because of the unique Julia compilation model, the JuliaGPU organisation work for reusable GPU backend infrastructure, and especially the KernelAbstractions.jl backend-agnostic kernel language. Thank you.
2. Status
The AcceleratedKernels.jl GPU sort and accumulate implementations were adopted as the official AMDGPU algorithms! The API is starting to stabilise; it follows the Julia standard library fairly closely - and additionally exposing all temporary arrays for memory reuse. For any new ideas / requests, please join the conversation on Julia Discourse or post an issue.
We have an extensive randomised test suite that we run on the CPU (single- and multi-threaded) backend on Windows, Ubuntu and MacOS for Julia LTS, Stable, and Pre-Release, plus the CUDA, AMDGPU, oneAPI and Metal backends on the JuliaGPU buildkite - the exact same tests are run on all architectures to ensure uniform interfaces.
AcceleratedKernels.jl is also a fundamental building block of applications developed at EvoPhase, so it will see continuous heavy use with industry backing. Long-term stability, performance improvements and support are priorities for us.
3. Benchmarks
Some arithmetic-heavy benchmarks are given below - see this repository for the code; our paper will be linked here upon publishing with a full analysis.

See prototype/sort_benchmark.jl for a small-scale sorting benchmark code and prototype/thrust_sort for the Nvidia Thrust wrapper. The results below are from a system with Linux 6.6.30-2-MANJARO, Intel Core i9-10885H CPU, Nvidia Quadro RTX 4000 with Max-Q Design GPU, Thrust 1.17.1-1, Julia Version 1.10.4.

As a first implementation in AcceleratedKernels.jl, we are on the same order of magnitude as Nvidia's official sorter (x3.48 slower), and an order of magnitude faster (x10.19) than the Julia Base CPU radix sort (which is already one of the fastest).
The sorting algorithms can also be combined with MPISort.jl for multi-device sorting - indeed, you can co-operatively sort using both your CPU and GPU! Or use 200 GPUs on the 52 nodes of Baskerville HPC to sort 538-855 GB of data per second (comparable with the highest figure reported in literature of 900 GB/s on 262,144 CPU cores):

Hardware stats for nerds available here. Full analysis will be linked here once our paper is published.
4. Functions Implemented
Below is an overview of the currently-implemented algorithms, along with some common names in other libraries for ease of finding / understanding / porting code - click on the function family to see the corresponding Manual entry.
If you need oth
Related Skills
node-connect
337.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
337.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.2kCommit, push, and open a PR

