ParallelStencil.jl

Package for writing high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs

Generate Convert Improve

Install / Use

/learn @omlins/ParallelStencil.jl

About this skill

Quality Score

0/100

README

<h1> <img src="docs/logo/logo_ParallelStencil.png" alt="ParallelStencil.jl" width="50"> ParallelStencil.jl </h1>

ParallelStencil empowers domain scientists to write architecture-agnostic high-level code for parallel high-performance stencil computations on GPUs and CPUs. Performance similar to CUDA C / HIP can be achieved, which is typically a large improvement over the performance reached when using only [CUDA.jl] or [AMDGPU.jl] [GPU Array programming]. For example, a 2-D shallow ice solver presented at JuliaCon 2020 [[1][JuliaCon20a]] achieved a nearly 20 times better performance than a corresponding [GPU Array programming] implementation; in absolute terms, it reached 70% of the theoretical upper performance bound of the used Nvidia P100 GPU, as defined by the effective throughput metric, T_eff (note that T_eff is very different from common throughput metrics, see section Performance metric). The GPU performance of the solver is reported in green, the CPU performance in blue:

<a id="fig_teff"> Performance ParallelStencil Teff </a>

ParallelStencil relies on the native kernel programming capabilities of [CUDA.jl], [AMDGPU.jl], [Metal.jl], the multi-architecture [KernelAbstractions.jl] package (enabling the runtime hardware selection workflow described in Interactive prototyping with runtime hardware selection), and on [Polyester.jl] and [Base.Threads] for high-performance computations on GPUs and CPUs, respectively. It is seamlessly interoperable with [ImplicitGlobalGrid.jl], which renders the distributed parallelization of stencil-based GPU and CPU applications on a regular staggered grid almost trivial and enables close to ideal weak scaling of real-world applications on thousands of GPUs [[1][JuliaCon20a], [2][JuliaCon20b], [3][JuliaCon19], [4][PASC19]]. Moreover, ParallelStencil enables hiding communication behind computation with a simple macro call and without any particular restrictions on the package used for communication. ParallelStencil has been designed in conjunction with [ImplicitGlobalGrid.jl] for simplest possible usage by domain-scientists, rendering fast and interactive development of massively scalable high performance multi-GPU applications readily accessible to them. Furthermore, we have developed a self-contained approach for "Solving Nonlinear Multi-Physics on GPU Supercomputers with Julia" relying on ParallelStencil and [ImplicitGlobalGrid.jl] [[1][JuliaCon20a]]. ParallelStencil's feature to hide communication behind computation was showcased when a close to ideal weak scaling was demonstrated for a 3-D poro-hydro-mechanical real-world application on up to 1024 GPUs on the Piz Daint Supercomputer [[1][JuliaCon20a]]:

Parallel efficiency of ParallelStencil with CUDA C backend

A particularity of ParallelStencil is that it enables writing a single high-level Julia code that can be deployed both on a CPU or a GPU. In conjuction with [ImplicitGlobalGrid.jl] the same Julia code can even run on a single CPU thread or on thousands of GPUs/CPUs.

Beyond traditional high-performance computing, ParallelStencil supports automatic differentiation of architecture-agnostic parallel kernels relying on [Enzyme.jl], enabling both high-level and generic syntax for maximal flexibility.

Contents

Parallelization and optimization with one macro call
Stencil computations with math-close notation
50-lines example deployable on GPU and CPU
50-lines multi-xPU example
Interactive prototyping with runtime hardware selection
Seamless interoperability with communication packages and hiding communication
Support for architecture-agnostic low level kernel programming
Support for logical arrays of small arrays / structs
Support for automatic differentiation of architecture-agnostic parallel kernels
Module documentation callable from the Julia REPL / IJulia
Concise single/multi-xPU miniapps
Dependencies
Installation
Questions, comments and discussions
Your contributions
References

Parallelization and optimization with one macro call

A simple call to @parallel is enough to parallelize and optimize a function and to launch it. The package used underneath for parallelization is defined in a call to @init_parallel_stencil beforehand. Supported are [CUDA.jl], [AMDGPU.jl], [Metal.jl], and the multi-architecture [KernelAbstractions.jl] backend for running on GPU, and [Base.Threads] and [Polyester.jl] for executing on CPU; when using KernelAbstractions the session starts on the CPU and you can switch the hardware target mid-run via select_hardware/current_hardware as outlined in Interactive prototyping with runtime hardware selection. The following example outlines how to run parallel computations on a GPU using the native kernel programming capabilities of [CUDA.jl] underneath (omitted lines are represented with #(...), omitted arguments with ...):

#(...)
@init_parallel_stencil(CUDA,...)
#(...)
@parallel function diffusion3D_step!(...)
    #(...)
end
#(...)
@parallel diffusion3D_step!(...)

Automatic advanced fast memory usage optimization (of shared memory and registers) can be activated with the keyword argument memopt=true:

@parallel memopt=true function diffusion3D_step!(...)
    #(...)
end
#(...)
@parallel memopt=true diffusion3D_step!(...)

Note that arrays are automatically allocated on the hardware chosen for the computations (GPU or CPU) when using the provided allocation macros:

@zeros
@ones
@rand
@falses
@trues
@fill

Stencil computations with math-close notation

ParallelStencil provides submodules for computing finite differences in 1-D, 2-D and 3-D with a math-close notation (FiniteDifferences1D, FiniteDifferences2D and FiniteDifferences3D). Custom macros to extend the finite differences submodules or for other stencil-based numerical methods can be readily plugged in. The following example shows a complete function for computing a time step of a 3-D heat diffusion solver using FiniteDifferences3D.

#(...)
using ParallelStencil.FiniteDifferences3D
#(...)
@parallel function diffusion3D_step!(T2, T, Ci, lam, dt, dx, dy, dz)
    @inn(T2) = @inn(T) + dt*(lam*@inn(Ci)*(@d2_xi(T)/dx^2 + @d2_yi(T)/dy^2 + @d2_zi(T)/dz^2));
    return
end

The macros used in this example are described in the Module documentation callable from the Julia REPL / IJulia:

julia> using ParallelStencil.FiniteDifferences3D
julia>?
help?> @inn
  @inn(A): Select the inner elements of A. Corresponds to A[2:end-1,2:end-1,2:end-1].

help?> @d2_xi
  @d2_xi(A): Compute the 2nd order differences between adjacent elements of A along the along dimension x and select the inner elements of A in the remaining dimensions. Corresponds to @inn_yz(@d2_xa(A)).

Note that@d2_yi and @d2_zi perform the analogue operations as @d2_xi along the dimension y and z, respectively.

Type ?FiniteDifferences3D in the [Julia REPL] to explore all provided macros.

50-lines example deployable on GPU and CPU

This concise 3-D heat diffusion solver uses ParallelStencil and a simple boolean USE_GPU defines whether it runs on GPU or CPU (the environment variable [JULIA_NUM_THREADS] defines how many cores are used in the latter case):

const USE_GPU = true
using ParallelStencil
using ParallelStencil.FiniteDifferences3D
@static if USE_GPU
    @init_parallel_stencil(CUDA, Float64, 3);
else
    @init_parallel_stencil(Threads, Float64, 3);
end

@parallel function diffusion3D_step!(T2, T, Ci, lam, dt, dx, dy, dz)
    @inn(T2) = @inn(T) + dt*(lam*@inn(Ci)*(@d2_xi(T)/dx^2 + @d2_yi(T)/dy^2 + @d2_zi(T)/dz^2));
    return
end

function diffusion3D()
# Physics
lam        = 1.0;                                        # Thermal conductivity
cp_min     = 1.0;                                        # Minimal heat capacity
lx, ly, lz = 10.0, 10.0, 10.0;                           # Length of domain in dimensions x, y and z.

# Numerics
nx, ny, nz = 256, 256, 256;                              # Number of gridpoints dimensions x, y and z.
nt         = 100;                                        # Number of time step