GDRCopy

A low-latency GPU memory copy library based on NVIDIA GPUDirect RDMA technology.

Introduction

While GPUDirect RDMA is meant for direct access to GPU memory from third-party devices, it is possible to use these same APIs to create perfectly valid CPU mappings of the GPU memory.

The advantage of a CPU driven copy is the very small overhead involved. That might be useful when low latencies are required.

What is inside

GDRCopy offers the infrastructure to create user-space mappings of GPU memory, which can then be manipulated as if it was plain host memory (caveats apply here).

A simple by-product of it is a copy library with the following characteristics:

very low overhead, as it is driven by the CPU. As a reference, currently a cudaMemcpy can incur in a 6-7us overhead.
An initial memory pinning phase is required, which is potentially expensive, 10us-1ms depending on the buffer size.
Fast H-D, because of write-combining. H-D bandwidth is 6-8GB/s on Ivy Bridge Xeon but it is subject to NUMA effects.
Slow D-H, because the GPU BAR, which backs the mappings, can't be prefetched and so burst reads transactions are not generated through PCIE

The library comes with a few tests like:

gdrcopy_sanity, which contains unit tests for the library and the driver.
gdrcopy_copybw, a minimal application which calculates the R/W bandwidth for a specific buffer size.
gdrcopy_copylat, a benchmark application which calculates the R/W copy latency for a range of buffer sizes.
gdrcopy_apiperf, an application for benchmarking the latency of each GDRCopy API call.
gdrcopy_pplat, a benchmark application which calculates the round-trip ping-pong latency between GPU and CPU.

Requirements

GPUDirect RDMA requires NVIDIA Data Center GPU or NVIDIA RTX GPU (formerly Tesla and Quadro) based on Kepler or newer generations, see GPUDirect RDMA. For more general information, please refer to the official GPUDirect RDMA design document.

The device driver requires GPU display driver >= 418.40 on ppc64le and >= 331.14 on other platforms. The library and tests require CUDA >= 6.0.

DKMS is a prerequisite for installing GDRCopy kernel module package. On RHEL or SLE, however, users have an option to build kmod and install it instead of the DKMS package. See Build and installation section for more details.

# On RHEL
# dkms can be installed from epel-release. See https://fedoraproject.org/wiki/EPEL.
$ sudo yum install dkms

# On Debian - No additional dependency

# On SLE / Leap
# On SLE dkms can be installed from PackageHub.
$ sudo zypper install dkms rpmbuild

CUDA and GPU display driver must be installed before building and/or installing GDRCopy. The installation instructions can be found in https://developer.nvidia.com/cuda-downloads.

GPU display driver header files are also required. They are installed as a part of the driver (or CUDA) installation with runfile. If you install the driver via package management, we suggest

On RHEL, sudo dnf module install nvidia-driver:latest-dkms.
On Debian, sudo apt install nvidia-dkms-<your-nvidia-driver-version>.
On SLE, sudo zypper install nvidia-gfx<your-nvidia-driver-version>-kmp.

The supported architectures are Linux x86_64, ppc64le, and arm64. The supported platforms are RHEL8, RHEL9, Ubuntu20_04, Ubuntu22_04, SLE-15 (any SP) and Leap 15.x.

Root privileges are necessary to load/install the kernel-mode device driver.

Build and installation

We provide three ways for building and installing GDRCopy.

rpm package

# For RHEL:
$ sudo yum groupinstall 'Development Tools'
$ sudo yum install dkms rpm-build make

# For SLE:
$ sudo zypper in dkms rpmbuild

$ cd packages
$ CUDA=<cuda-install-top-dir> ./build-rpm-packages.sh
$ sudo rpm -Uvh gdrcopy-kmod-<version>dkms.noarch.<platform>.rpm
$ sudo rpm -Uvh gdrcopy-<version>.<arch>.<platform>.rpm
$ sudo rpm -Uvh gdrcopy-devel-<version>.noarch.<platform>.rpm

DKMS package is the default kernel module package that build-rpm-packages.sh generates. To create kmod package, -m option must be passed to the script. Unlike the DKMS package, the kmod package contains a prebuilt GDRCopy kernel module which is specific to the NVIDIA driver version and the Linux kernel version used to build it.

deb package

$ sudo apt install build-essential devscripts debhelper fakeroot pkg-config dkms
$ cd packages
$ CUDA=<cuda-install-top-dir> ./build-deb-packages.sh
$ sudo dpkg -i gdrdrv-dkms_<version>_<arch>.<platform>.deb
$ sudo dpkg -i libgdrapi_<version>_<arch>.<platform>.deb
$ sudo dpkg -i gdrcopy-tests_<version>_<arch>.<platform>.deb
$ sudo dpkg -i gdrcopy_<version>_<arch>.<platform>.deb

from source

$ make prefix=<install-to-this-location> CUDA=<cuda-install-top-dir> all install
$ sudo ./insmod.sh

Notes

Compiling the gdrdrv driver requires the NVIDIA driver source code, which is typically installed at /usr/src/nvidia-<version>. Our make file automatically detects and picks that source code. In case there are multiple versions installed, it is possible to pass the correct path by defining the NVIDIA_SRC_DIR variable, e.g. export NVIDIA_SRC_DIR=/usr/src/nvidia-520.61.05/nvidia before building the gdrdrv module.

There are two major flavors of NVIDIA driver: 1) proprietary, and 2) opensource. We detect the flavor when compiling gdrdrv based on the source code of the NVIDIA driver. Different flavors come with different features and restrictions:

gdrdrv compiled with the opensource flavor will provide functionality and high performance on all platforms. However, you will not be able to load this gdrdrv driver when the proprietary NVIDIA driver is loaded.
gdrdrv compiled with the proprietary flavor can always be loaded regardless of the flavor of NVIDIA driver you have loaded. However, it may have suboptimal performance on coherent platforms such as Grace-Hopper. Functionally, it will not work correctly on Intel CPUs with Linux kernel built with confidential compute (CC) support, i.e. CONFIG_ARCH_HAS_CC_PLATFORM=y, WHEN CC is enabled at runtime.

Tests

Execute provided tests:

$ gdrcopy_sanity 
Total: 28, Passed: 28, Failed: 0, Waived: 0

List of passed tests:
    basic_child_thread_pins_buffer_cumemalloc
    basic_child_thread_pins_buffer_vmmalloc
    basic_cumemalloc
    basic_small_buffers_mapping
    basic_unaligned_mapping
    basic_vmmalloc
    basic_with_tokens
    data_validation_cumemalloc
    data_validation_vmmalloc
    invalidation_access_after_free_cumemalloc
    invalidation_access_after_free_vmmalloc
    invalidation_access_after_gdr_close_cumemalloc
    invalidation_access_after_gdr_close_vmmalloc
    invalidation_fork_access_after_free_cumemalloc
    invalidation_fork_access_after_free_vmmalloc
    invalidation_fork_after_gdr_map_cumemalloc
    invalidation_fork_after_gdr_map_vmmalloc
    invalidation_fork_child_gdr_map_parent_cumemalloc
    invalidation_fork_child_gdr_map_parent_vmmalloc
    invalidation_fork_child_gdr_pin_parent_with_tokens
    invalidation_fork_map_and_free_cumemalloc
    invalidation_fork_map_and_free_vmmalloc
    invalidation_two_mappings_cumemalloc
    invalidation_two_mappings_vmmalloc
    invalidation_unix_sock_shared_fd_gdr_map_cumemalloc
    invalidation_unix_sock_shared_fd_gdr_map_vmmalloc
    invalidation_unix_sock_shared_fd_gdr_pin_buffer_cumemalloc
    invalidation_unix_sock_shared_fd_gdr_pin_buffer_vmmalloc


$ gdrcopy_copybw
GPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0000:06:00
GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0000:07:00
GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0000:0a:00
GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0000:0b:00
GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0000:85:00
GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0000:86:00
GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0000:89:00
GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0000:8a:00
selecting device 0
testing size: 131072
rounded size: 131072
gpu alloc fn: cuMemAlloc
device ptr: 7f1153a00000
map_d_ptr: 0x7f1172257000
info.va: 7f1153a00000
info.mapped_size: 131072
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer:0x7f1172257000
writing test, size=131072 offset=0 num_iters=10000
write BW: 9638.54MB/s
reading test, size=131072 offset=0 num_iters=100
read BW: 530.135MB/s
unmapping buffer
unpinning buffer
closing gdrdrv


$ gdrcopy_copylat
GPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0000:06:00
GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0000:07:00
GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0000:0a:00
GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0000:0b:00
GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0000:85:00
GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0000:86:00
GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0000:89:00
GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0000:8a:00
selecting device 0
device ptr: 0x7fa2c6000000
allocated size: 16777216
gpu alloc fn: cuMemAlloc

map_d_ptr: 0x7fa2f9af9000
info.va: 7fa2c6000000
info.mapped_size: 16777216
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer: 0x7fa2f9af9000

gdr_copy_to_mapping num iters for each size: 10000
WARNING: Measuring the API invocation overhead as observed by the CPU. Data
might not be ordered all the way to the GPU internal visibility.
Test             Size(B)     Avg.Time(us)
gdr_copy_to_mapping             1         0.0889
gdr_copy_to_mapping             2         0.0884
gdr_copy_to_mapping             4         0.0884
gdr_copy_to_mapping             8         0.0884
gdr_copy_to_mapping            16         0.0905
gdr

Gdrcopy

Install / Use

README