Alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:

Generate Convert Improve

Install / Use

/learn @alpaka-group/Alpaka

About this skill

Quality Score

0/100

README

alpaka - Abstraction Library for Parallel Kernel Acceleration

The alpaka library is a header-only C++20 abstraction library for accelerator development.

Its aim is to provide performance portability across accelerators through the abstraction (not hiding!) of the underlying levels of parallelism.

It is platform independent and supports the concurrent and cooperative use of multiple devices such as the hosts CPU (x86, ARM, RISC-V and Power 8+) and GPU accelerators from different vendors (NVIDIA, AMD and Intel). A multitude of accelerator back-end variants using CUDA, HIP, SYCL, OpenMP 2.0+, std::thread and also serial execution is provided and can be selected depending on the device. Only one implementation of the user kernel is required by representing them as function objects with a special interface. There is no need to write special CUDA, HIP, SYCL, OpenMP or custom threading code. Accelerator back-ends can be mixed and synchronized via compute device queue. The decision which accelerator back-end executes which kernel can be made at runtime.

The abstraction used is very similar to the CUDA grid-blocks-threads domain decomposition strategy. Algorithms that should be parallelized have to be divided into a multi-dimensional grid consisting of small uniform work items. These functions are called kernels and are executed in parallel threads. The threads in the grid are organized in blocks. All threads in a block are executed in parallel and can interact via fast shared memory and low level synchronization methods. Blocks are executed independently and can not interact in any way. The block execution order is unspecified and depends on the accelerator in use. By using this abstraction the execution can be optimally adapted to the available hardware.

Software License

alpaka is licensed under MPL-2.0.

Documentation

The alpaka documentation can be found in the online manual. The documentation files in .rst (reStructuredText) format are located in the docs subfolder of this repository. The source code documentation is generated with doxygen.

Accelerator Back-ends

| Accelerator Back-end | Lib/API | Devices | Execution strategy grid-blocks | Execution strategy block-threads | |------------------------|---------------------------------------------------------|----------------------------|------------------------------------|--------------------------------------| | Serial | n/a | Host CPU (single core) | sequential | sequential (only 1 thread per block) | | OpenMP 2.0+ blocks | OpenMP 2.0+ | Host CPU (multi core) | parallel (preemptive multitasking) | sequential (only 1 thread per block) | | OpenMP 2.0+ threads | OpenMP 2.0+ | Host CPU (multi core) | sequential | parallel (preemptive multitasking) | | std::thread | std::thread | Host CPU (multi core) | sequential | parallel (preemptive multitasking) | | TBB | TBB 2.2+ | Host CPU (multi core) | parallel (preemptive multitasking) | sequential (only 1 thread per block) | | CUDA | CUDA 12.0+ | NVIDIA GPUs | parallel (undefined) | parallel (lock-step within warps) | | HIP(clang) | HIP 6.0+ | AMD GPUs | parallel (undefined) | parallel (lock-step within warps) | | SYCL(oneAPI) | oneAPI 2024.2+ | CPUs, Intel GPUs and FPGAs | parallel (undefined) | parallel (lock-step within warps) |

Supported Compilers

This library uses C++20 (or newer when available).

| Accelerator Back-end | gcc 11.1 (Linux) | gcc 12.3 (Linux) | gcc 13.1 (Linux) | clang 14 (Linux) | clang 15 (Linux) | clang 16 (Linux) | clang 17 (Linux) | clang 18 (Linux) | clang 19 (Linux) | clang 20 (Linux) | icpx 2025.0 (Linux) | Xcode 15.4 / 16.1 (macOS) | Visual Studio 2022 (Windows) | |----------------------|--------------------------------|---------------------------------------|---------------------------------------|---------------------------------------|--------------------------------|---------------------------------------|---------------------------------------|---------------------------------------|---------------------------------------|--------------------------------|-------------------------|---------------------------|------------------------------| | Serial | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | OpenMP 2.0+ blocks | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: [^1] | :white_check_mark: | :white_check_mark: | | OpenMP 2.0+ threads | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: [^1] | :white_check_mark: | :white_check_mark: | | std::thread | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | TBB | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | CUDA (nvcc) | :white_check_mark: (CUDA 12.0) | :white_check_mark: (CUDA 12.0 - 12.5) | :white_check_mark: (CUDA 12.4 - 13.0) | :white_check_mark: (CUDA 12.2 - 12.3) | :white_check_mark: (CUDA 12.9) | :white_check_mark: (CUDA 12.6 - 12.9) | :white_check_mark: (CUDA 12.5 - 12.9) | :white_check_mark: (CUDA 12.6 - 12.9) | :white_check_mark: (CUDA 12.8 - 12.9) | :white_check_mark: (CUDA 13.0) | :x: | - | :x: | | CUDA (clang) | - | - | - | :x: | :white_check_mark: (12.9) | :x: | :x: | :x: | :x: | :x: | :x: | - | - | | HIP (clang) | - | - | - | :x: | :x: | :x: | :white_check_mark: (HIP 6.0 - 6.1) | :white_check_mark: (HIP 6.2 - 6.3) | :white_check_mark: (HIP 6.4) | :white_check_mar

Related Skills

node-connect

335.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

82.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

335.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

82.7k

Commit, push, and open a PR