ParallelStencil.jl
Package for writing high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs
Install / Use
/learn @omlins/ParallelStencil.jlREADME
ParallelStencil empowers domain scientists to write architecture-agnostic high-level code for parallel high-performance stencil computations on GPUs and CPUs. Performance similar to CUDA C / HIP can be achieved, which is typically a large improvement over the performance reached when using only [CUDA.jl] or [AMDGPU.jl] [GPU Array programming]. For example, a 2-D shallow ice solver presented at JuliaCon 2020 [[1][JuliaCon20a]] achieved a nearly 20 times better performance than a corresponding [GPU Array programming] implementation; in absolute terms, it reached 70% of the theoretical upper performance bound of the used Nvidia P100 GPU, as defined by the effective throughput metric, T_eff (note that T_eff is very different from common throughput metrics, see section Performance metric). The GPU performance of the solver is reported in green, the CPU performance in blue:
<a id="fig_teff">
</a>
ParallelStencil relies on the native kernel programming capabilities of [CUDA.jl], [AMDGPU.jl], [Metal.jl], the multi-architecture [KernelAbstractions.jl] package (enabling the runtime hardware selection workflow described in Interactive prototyping with runtime hardware selection), and on [Polyester.jl] and [Base.Threads] for high-performance computations on GPUs and CPUs, respectively. It is seamlessly interoperable with [ImplicitGlobalGrid.jl], which renders the distributed parallelization of stencil-based GPU and CPU applications on a regular staggered grid almost trivial and enables close to ideal weak scaling of real-world applications on thousands of GPUs [[1][JuliaCon20a], [2][JuliaCon20b], [3][JuliaCon19], [4][PASC19]]. Moreover, ParallelStencil enables hiding communication behind computation with a simple macro call and without any particular restrictions on the package used for communication. ParallelStencil has been designed in conjunction with [ImplicitGlobalGrid.jl] for simplest possible usage by domain-scientists, rendering fast and interactive development of massively scalable high performance multi-GPU applications readily accessible to them. Furthermore, we have developed a self-contained approach for "Solving Nonlinear Multi-Physics on GPU Supercomputers with Julia" relying on ParallelStencil and [ImplicitGlobalGrid.jl] [[1][JuliaCon20a]]. ParallelStencil's feature to hide communication behind computation was showcased when a close to ideal weak scaling was demonstrated for a 3-D poro-hydro-mechanical real-world application on up to 1024 GPUs on the Piz Daint Supercomputer [[1][JuliaCon20a]]:

A particularity of ParallelStencil is that it enables writing a single high-level Julia code that can be deployed both on a CPU or a GPU. In conjuction with [ImplicitGlobalGrid.jl] the same Julia code can even run on a single CPU thread or on thousands of GPUs/CPUs.
Beyond traditional high-performance computing, ParallelStencil supports automatic differentiation of architecture-agnostic parallel kernels relying on [Enzyme.jl], enabling both high-level and generic syntax for maximal flexibility.
Contents <!-- omit in toc -->
- Parallelization and optimization with one macro call
- Stencil computations with math-close notation
- 50-lines example deployable on GPU and CPU
- 50-lines multi-xPU example
- Interactive prototyping with runtime hardware selection
- Seamless interoperability with communication packages and hiding communication
- Support for architecture-agnostic low level kernel programming
- Support for logical arrays of small arrays / structs
- Support for automatic differentiation of architecture-agnostic parallel kernels
- Module documentation callable from the Julia REPL / IJulia
- Concise single/multi-xPU miniapps
- Dependencies
- Installation
- Questions, comments and discussions
- Your contributions
- References
Parallelization and optimization with one macro call
A simple call to @parallel is enough to parallelize and optimize a function and to launch it. The package used underneath for parallelization is defined in a call to @init_parallel_stencil beforehand. Supported are [CUDA.jl], [AMDGPU.jl], [Metal.jl], and the multi-architecture [KernelAbstractions.jl] backend for running on GPU, and [Base.Threads] and [Polyester.jl] for executing on CPU; when using KernelAbstractions the session starts on the CPU and you can switch the hardware target mid-run via select_hardware/current_hardware as outlined in Interactive prototyping with runtime hardware selection. The following example outlines how to run parallel computations on a GPU using the native kernel programming capabilities of [CUDA.jl] underneath (omitted lines are represented with #(...), omitted arguments with ...):
#(...)
@init_parallel_stencil(CUDA,...)
#(...)
@parallel function diffusion3D_step!(...)
#(...)
end
#(...)
@parallel diffusion3D_step!(...)
Automatic advanced fast memory usage optimization (of shared memory and registers) can be activated with the keyword argument memopt=true:
@parallel memopt=true function diffusion3D_step!(...)
#(...)
end
#(...)
@parallel memopt=true diffusion3D_step!(...)
Note that arrays are automatically allocated on the hardware chosen for the computations (GPU or CPU) when using the provided allocation macros:
@zeros@ones@rand@falses@trues@fill
Stencil computations with math-close notation
ParallelStencil provides submodules for computing finite differences in 1-D, 2-D and 3-D with a math-close notation (FiniteDifferences1D, FiniteDifferences2D and FiniteDifferences3D). Custom macros to extend the finite differences submodules or for other stencil-based numerical methods can be readily plugged in. The following example shows a complete function for computing a time step of a 3-D heat diffusion solver using FiniteDifferences3D.
#(...)
using ParallelStencil.FiniteDifferences3D
#(...)
@parallel function diffusion3D_step!(T2, T, Ci, lam, dt, dx, dy, dz)
@inn(T2) = @inn(T) + dt*(lam*@inn(Ci)*(@d2_xi(T)/dx^2 + @d2_yi(T)/dy^2 + @d2_zi(T)/dz^2));
return
end
The macros used in this example are described in the Module documentation callable from the Julia REPL / IJulia:
julia> using ParallelStencil.FiniteDifferences3D
julia>?
help?> @inn
@inn(A): Select the inner elements of A. Corresponds to A[2:end-1,2:end-1,2:end-1].
help?> @d2_xi
@d2_xi(A): Compute the 2nd order differences between adjacent elements of A along the along dimension x and select the inner elements of A in the remaining dimensions. Corresponds to @inn_yz(@d2_xa(A)).
Note that@d2_yi and @d2_zi perform the analogue operations as @d2_xi along the dimension y and z, respectively.
Type ?FiniteDifferences3D in the [Julia REPL] to explore all provided macros.
50-lines example deployable on GPU and CPU
This concise 3-D heat diffusion solver uses ParallelStencil and a simple boolean USE_GPU defines whether it runs on GPU or CPU (the environment variable [JULIA_NUM_THREADS] defines how many cores are used in the latter case):
const USE_GPU = true
using ParallelStencil
using ParallelStencil.FiniteDifferences3D
@static if USE_GPU
@init_parallel_stencil(CUDA, Float64, 3);
else
@init_parallel_stencil(Threads, Float64, 3);
end
@parallel function diffusion3D_step!(T2, T, Ci, lam, dt, dx, dy, dz)
@inn(T2) = @inn(T) + dt*(lam*@inn(Ci)*(@d2_xi(T)/dx^2 + @d2_yi(T)/dy^2 + @d2_zi(T)/dz^2));
return
end
function diffusion3D()
# Physics
lam = 1.0; # Thermal conductivity
cp_min = 1.0; # Minimal heat capacity
lx, ly, lz = 10.0, 10.0, 10.0; # Length of domain in dimensions x, y and z.
# Numerics
nx, ny, nz = 256, 256, 256; # Number of gridpoints dimensions x, y and z.
nt = 100; # Number of time step
Related Skills
tmux
352.5kRemote-control tmux sessions for interactive CLIs by sending keystrokes and scraping pane output.
diffs
352.5kUse the diffs tool to produce real, shareable diffs (viewer URL, file artifact, or both) instead of manual edit summaries.
terraform-provider-genesyscloud
Terraform Provider Genesyscloud
blogwatcher
352.5kMonitor blogs and RSS/Atom feeds for updates using the blogwatcher CLI.
