gtensor

gtensor is a multi-dimensional array C++14 header-only library for hybrid GPU development. It was inspired by xtensor, and designed to support the GPU port of the GENE fusion code.

Features:

multi-dimensional arrays and array views, with easy interoperability with Fortran and thrust
automatically generate GPU kernels based on array operations
define complex re-usable operations with lazy evaluation. This allows operations to be composed in different ways and evaluated once as a single kernel
easily support both CPU-only and GPU-CPU hybrid code in the same code base, with only minimal use of #ifdef.
multi-dimensional array slicing similar to numpy
GPU support for nVidia via CUDA and AMD via HIP/ROCm, and experimental Intel GPU support via SYCL.
[Experimental] C library cgtensor with wrappers around common GPU operations (allocate and deallocate, device management, memory copy and set)
[Experimental] lightweight wrappers around GPU BLAS, LAPACK, and FFT routines.

License

gtensor is licensed under the 3-clause BSD license. See the LICENSE file for details.

Installation (cmake)

gtensor uses cmake 3.13+ to build the tests and install:

git clone https://github.com/wdmapp/gtensor.git
cd gtensor
cmake -S . -B build -DGTENSOR_DEVICE=cuda \
  -DCMAKE_INSTALL_PREFIX=/opt/gtensor \
  -DBUILD_TESTING=OFF
cmake --build build --target install

To build for cpu/host only, use -DGTENSOR_DEVICE=host, for AMD/HIP use -DGTENSOR_DEVICE=hip -DCMAKE_CXX_COMPILER=$(which hipcc), and for Intel/SYCL use -DGTENSOR_DEVICE=sycl -DCMAKE_CXX_COMPILER=$(which dpcpp) See sections below for more device specific requirements.

Note that gtensor can still be used by applications not using cmake - see Usage (GNU make) for an example.

To use the internal data vector implementation instead of thrust, set -DGTENSOR_USE_THRUST=OFF. This has the advantage that device array allocations will not be zero initialized, which can improve performance significantly for some workloads, particularly when temporary arrays are used.

To enable experimental C/C++ library features,GTENSOR_BUILD_CLIB, GTENSOR_BUILD_BLAS, or GTENSOR_BUILD_FFT to ON. Note that BLAS includes some LAPACK routines for LU factorization.

nVidia CUDA requirements

gtensor for nVidia GPUs with CUDA requires CUDA Toolkit 10.0+.

AMD HIP requirements

gtensor for AMD GPUs with HIP requires ROCm 4.5.0+, and rocthrust and rocprim. See the ROCm installation guide for details. In Ubuntu, after setting up the ROCm repository, the required packages can be installed like this:

sudo apt install rocm-dkms rocm-dev rocthrust

The official packages install to /opt/rocm. If using a different install location, set the ROCM_PATH cmake variable. To use coarse grained managed memory, ROCm 5.0+ is required.

To use gt-fft and gt-blas, rocsolver, rocblas, and rocfft packages need to be installed as well.

Intel SYCL requirements

The current SYCL implementation requires Intel OneAPI/DPC++ 2022.0 or later, with some known issues in gt-blas and gt-fft (npvt getrf/rs, 2d fft). Using the latest available release is recommended. When using the instructions at install via package managers, installing the intel-oneapi-dpcpp-compiler package will pull in all required packages (the rest of basekit is not required).

The reason for the dependence on Intel OneAPI is that the implementation uses the USM extension, which is not part of the current SYCL standard. CodePlay ComputeCpp 2.0.0 has an experimental implementation that is sufficiently different to require extra work to support.

The default device selector is always used. To control device section, set the SYCL_DEVICE_FILTER environment variable. See the intel llvm documentation for details.

The port is tested with an Intel iGPU, specifically UHD Graphics 630. It may also work with the experimental CUDA backend for nVidia GPUs, but this is untested and it's recommended to use the gtensor CUDA backend instead.

Better support for other SYCL implementations like hipSYCL and ComputCPP should be possible to add, with the possible exception of gt-blas and gt-fft sub-libraries which require oneMKL.

HOST CPU (no device) requirements

gtensor should build with any C++ compiler supporting C++14. It has been tested with g++ 7, 8, and 9 and clang++ 8, 9, and 10.

Advanced multi-device configuration

By default, gtensor will install support for the device specified by the GTENSOR_DEVICE variable (default cuda), and also the host (cpu only) device. This can be configured with GTENSOR_BUILD_DEVICES as a semicolon (;) separated list. For example, to build support for all four backends (assuming a machine with multi-vendor GPUs and associated toolkits installed).

cmake -S . -B build -DGTENSOR_DEVICE=cuda \
  -DGTENSOR_BUILD_DEVICES=host;cuda;hip;sycl \
  -DCMAKE_INSTALL_PREFIX=/opt/gtensor \
  -DBUILD_TESTING=OFF

This will cause targets to be created for each device: gtensor::gtensor_cuda, gtensor::gtensor_host, gtensor::gtensor_hip, and gtensor::gtensor_sycl. The main gtensor::gtensor target will be an alias for the default set by GTENSOR_DEVICE (the cuda target in the above example).

Usage (cmake)

Once installed, gtensor can be used by adding this to a project's CMakeLists.txt:

# if using GTENSOR_DEVICE=cuda
enable_language(CUDA)

find_library(gtensor)

# for each C++ target using gtensor
target_gtensor_sources(myapp PRIVATE src/myapp.cxx)
target_link_libraries(myapp gtensor::gtensor)

When running cmake for a project, add the gtensor install prefix to CMAKE_PREFIX_PATH. For example:

cmake -S . -B build -DCMAKE_PREFIX_PATH=/opt/gtensor

The default gtensor device, set with the GTENSOR_DEVICE cmake variable when installing gtensor, can be overridden by setting GTENSOR_DEVICE again in the client application before the call to find_library(gtensor), typically via the -D cmake command line option. This can be useful to debug an application by setting -DGTENSOR_DEVICE=host, to see if the problem is related to the hybrid device model or is an algorithmic problem, or to run a host-only interactive debugger. Note that only devices specified with GTENSOR_BUILD_DEVICES at gtensor install time are available (the default device and host if no option was specified).

Using gtensor as a subdirectory or git submodule

gtensor also supports usage as a subdiretory of another cmake project. This is typically done via git submodules. For example:

cd /path/to/app
git submodule add https://github.com/wdmapp/gtensor.git external/gtensor

In the application's CMakeLists.txt:

# set here or on the cmake command-line with `-DGTENSOR_DEVICE=...`.
set(GTENSOR_DEVICE "cuda" CACHE STRING "")

if (${GTENSOR_DEVICE} STREQUAL "cuda")
  enable_language(CUDA)
endif()

# after setting GTENSOR_DEVICE
add_subdirectory(external/gtensor)

# for each C++ target using gtensor
target_gtensor_sources(myapp PRIVATE src/myapp.cxx)
target_link_libraries(myapp gtensor::gtensor)

Usage (GNU make)

As a header only library, gtensor can be integrated into an existing GNU make project as a subdirectory fairly easily for cuda and host devices.

The subdirectory is typically managed via git submodules, for example:

cd /path/to/app
git submodule add https://github.com/wdmapp/gtensor.git external/gtensor

See examples/Makefile for a good way of organizing a project's Makefile to provide cross-device support. The examples can be built for different devices by setting the GTENSOR_DEVICE variable, e.g. cd examples; make GTENSOR_DEVICE=host.

Getting Started

Basic Example (host CPU only)

Here is a simple example that computes a matrix with the multiplication table and prints it out row by row using array slicing:

#include <iostream>

#include <gtensor/gtensor.h>

int main(int argc, char **argv) {
    const int n = 9;
    gt::gtensor<int, 2> mult_table(gt::shape(n, n));

    for (int i=0; i<n; i++) {
        for (int j=0; j<n; j++) {
            mult_table(i,j) = (i+1)*(j+1);
        }
    }

    for (int i=0; i<n; i++) {
        std::cout << mult_table.view(i, gt::all) << std::endl;
    }
}

It can be built like this, using gcc version 5 or later:

g++ -std=c++14 -I /path/to/gtensor/include -o mult_table mult_table.cxx

and produces the following output:

{ 1 2 3 4 5 6 7 8 9 }
{ 2 4 6 8 10 12 14 16 18 }
{ 3 6 9 12 15 18 21 24 27 }
{ 4 8 12 16 20 24 28 32 36 }
{ 5 10 15 20 25 30 35 40 45 }
{ 6 12 18 24 30 36 42 48 54 }
{ 7 14 21 28 35 42 49 56 63 }
{ 8 16 24 32 40 48 56 64 72 }
{ 9 18 27 36 45 54 63 72 81 }

See the full mult_table example for different ways of performing this operation, taking advantage of more gtensor features.

GPU and CPU Example

The following program computed vector product a*x + y, where a is a scalar and x and y are vectors. If build with GTENSOR_HAVE_DEVICE defined and using the appropriate compiler (currently either nvcc or hipcc), it will run the computation on a GPU device.

See the full daxpy example for more detailed comments and an example of using an explicit kernel.

#include <iostre

Gtensor

Install / Use

README