Hemi: Simpler, More Portable CUDA C++

<img align="right" src="https://raw.github.com/harrism/hemi/master/hemi-logo-transparent.png" width="272" height="152"/> Hemi simplifies writing portable CUDA C/C++ code. With Hemi,

you can write parallel kernels like you write for loops—in line in your CPU code—and run them on your GPU;
you can easily write code that compiles and runs either on the CPU or GPU;
you can easily launch C++ Lambda functions as GPU kernels;
kernel launch configuration details like thread block size and grid size are optimization details, rather than requirements.

With Hemi, parallel code for the GPU can be as simple as the parallel_for loop in the following code, which can also be compiled and run on the CPU.

void saxpy(int n, float a, const float *x, float *y)
{
  hemi::parallel_for(0, n, [=] HEMI_LAMBDA (int i) {
      y[i] = a * x[i] + y[i];
  }); 
}

Current Version

This is version: 2.0 (HEMI_VERSION == 200000)

Hemi on github

The home for Hemi is http://harrism.github.io/hemi/, where you can find the latest changes and information.

Blog Posts

Read about Hemi 2 on the NVIDIA Parallel Forall Blog. An older post about Hemi 1.0.

Requirements

Hemi 2 requires a host compiler with support for C++11 or later. For CUDA device execution, Hemi requires CUDA 7.0 or later. To launch lambda expressions on the GPU using hemi::launch() or hemi::parallel_for(), Hemi requires CUDA 7.5 or later with experimental support for "extended lambdas" (enabled using the nvcc command line option --expt-extended-lambda).

Installing

Once you clone Hemi from Github, you need to sync its submodules in order to use the range helper (range.hpp) and to run tests based on the Google test framework. To do so, run the following Git commands in your hemi base directory:

> git submodule init
> git submodule update

Hemi is a header-only library, so there is no further installation required. Simply include the hemi headers for the features you need. See the examples for demonstration.

Features

GPU Lambdas and Parallel For

CUDA 7.5 provides an experimental feature, "GPU Lambdas", which enables C++11 Lambda functions with __device__ annotation to be defined in host code and passed to kernels running on the device. Hemi 2 leverages this feature to provide the hemi::parallel_for function which, when compiled for the GPU, launches a parallel kernel which executes the provided GPU lambda function as the body of a parallel loop. When compiled for the CPU, the lambda is executed as the body of a sequential CPU loop. This makes parallel functions nearly as easy to write as a for loop, as the following code shows:

parallel_for(0, 100, [] HEMI_LAMBDA (int i) { 
    printf("%d\n", i); 
});

GPU Lambdas can also be launched directly on the GPU using hemi::launch:

hemi::launch([=] HEMI_LAMBDA() {
    printf("Hello World from Lambda in thread %d of %d\n",
        hemi::globalThreadIndex(),
        hemi::globalThreadCount());
});

To launch lambda expressions on the GPU using hemi::launch() or hemi::parallel_for(), Hemi requires CUDA 7.5 or later with experimental support for "extended lambdas" (enabled using the nvcc command line option --expt-extended-lambda).

Portable Parallel Execution

hemi::launch can also be used to portably launch function objects (or functors), which are objects of classes that define an operator() member. To be launched on the GPU, the operator() should be declared with HEMI_DEV_CALLABLE_MEMBER. To make this easy, Hemi 2 provides the convenience macro HEMI_KERNEL_FUNCTION(). The simple example hello.cpp demonstrates its use:

HEMI_KERNEL_FUNCTION(hello) {
  printf("Hello World from thread %d of %d\n",
         hemi::globalThreadIndex(),
         hemi::globalThreadCount());
}

int main(void) {
  hello hi;
  hemi::launch(hi);          // launch on the GPU
  hemi::deviceSynchronize(); // make sure print flushes before exit

  hi();                      // call on CPU
  return 0;
}

As you can see, HEMI_KERNEL_FUNCTION() actually defines a functor which must be instantiated. Once instantiated, it can either be launched on the GPU or called from the CPU.

You can define portable CUDA kernel functions using HEMI_LAUNCHABLE, which defines the function using CUDA __global__ when compiled using nvcc, or as a normal host function otherwise. Launch these functions portably using hemi::cudaLaunch(). The example hello_global.cu demonstrates:

HEMI_LAUNCHABLE void hello() { 
  printf("Hello World from thread %d of %d\n", 
         hemi::globalThreadIndex(),
         hemi::globalThreadCount());
}

int main(void) {
  hemi::cudaLaunch(hello);
  hemi::deviceSynchronize(); // make sure print flushes before exit
  return 0;
}

Automatic Execution Configuration

In both of the examples in the previous section, the execution configuration (the number of thread blocks and size of each block) is automatically decided by Hemi based on the GPU it is running on. In general, when compiled for the GPU, hemi::launch(), hemi::cudaLaunch() and hemi::parallel_for() will choose a grid configuration that occupies all multiprocessors (SMs) on the GPU.

Automatic Execution Configuration is flexible, though. You can explicitly specify the entire execution configuration---grid size, thread block size, and dynamic shared memory allocation---or you can partially specify the execution configuration. For example, you might need to specify just the thread block size. Hemi makes it easy to take full control when you need it for performance tuning, but when you are getting started parallelizing your code, or for functions where ultimate performance is not crucial, you can just let Hemi configure the parallelism for you.

As an example, the nbody_vec4 example provides an optimized version of its main kernel that tiles data in CUDA shared memory. For this, it needs to specify the block size and shared memory allocation explicitly.

const int blockSize = 256;
hemi::ExecutionPolicy ep;
ep.setBlockSize(blockSize);
ep.setSharedMemBytes(blockSize * sizeof(Vec4f));
hemi::cudaLaunch(ep, allPairsForcesShared, forceVectors, bodies, N);

However, note that the number of blocks in the grid is left to Hemi to choose at run time.

Simple Grid-Stride Loops

A common design pattern in writing scalable, portable parallel CUDA kernels is to use grid-stride loops. Grid-stride loops let you decouple the size of your CUDA grid from the data size it is processing, resulting in less coupling between your host and device code. This also has portability and debugging benefits.

Hemi 2 includes a grid-stride range helper, grid_stride_range(), which makes it trivial to use C++11 range-based for loops to iterate in parallel. grid_stride_range() can be used in traditional CUDA kernels, such as the following saxpy kernel, or it can be combined with other Hemi portability features (in fact it is used in the implementation of hemi::parallel_for()).

__global__
void saxpy(int n, float a, float *x, float *y)
{
  for (auto i : grid_stride_range(0, n)) {
    y[i] = a * x[i] + y[i];
  }
}

hemi/hemi.h

The hemi.h header provides simple macros that are useful for reusing code between CUDA C/C++ and C/C++ written for other platforms (e.g. CPUs).

The macros are used to decorate function prototypes and variable declarations so that they can be compiled by either NVCC or a host compiler (for example gcc or cl.exe, the MS Visual Studio compiler).

The macros can be used within .cu, .cuh, .cpp, .h, and .inl files to define code that can be compiled either for the host (e.g., CPU) or the device (e.g., GPU).

hemi/array.h

One of the biggest challenges in writing portable CUDA code is memory management. HEMI provides the hemi::Array C++ template class as a simple data management wrapper which allows arrays of arbitrary type to be created and used with both host and device code. hemi::Array maintains a host and a device pointer for each array. It lazily transfers data between the host and device as needed when the user requests a pointer to the host or device memory. Pointer requests specify read-only, read/write, or write-only options so that valid flags can be maintained and data is only copied when the requested pointer is invalid.

For example, here is an excerpt from the nbody_vec4 example.

hemi::Array<Vec4f> bodies(N, true);
hemi::Array<Vec4f> forceVectors(N, true);
  	
randomizeBodies(bodies.writeOnlyHostPtr(), N);

// Call host function defined in a .cpp compilation unit
allPairsForcesHost(forceVectors.writeOnlyHostPtr(), bodies.
                   readOnlyHostPtr(), N);
printf("CPU: Force vector 0: (%0.3f, %0.3f, %0.3f)\n", 
       forceVectors.readOnlyHostPtr()[0].x, 
       forceVectors.readOnlyHostPtr()[0].y, 
       forceVectors.readOnlyHostPtr()[0].z);

  	...

// Call device function defined in a .cu compilation unit
// that uses host/device shared functions and class member functions
allPairsForcesCuda(forceVectors.wri

Hemi

Install / Use

README