matrix transpose algorithm CUDA-C

Simple example of a typical work flow to solve and optimize algorithm running on GPU with the support of command line nvprof profiling tool from Nvidia. NOTE: this project has been tested on Nvidia's Tegra X1 GPU (Maxwell architecture) you might have different behaviour if you run the code on a different GPU NOTE: it is required to have installed Nvidia CUDA Toolkit on your machine NOTE: the makefile also supports the doxygen documentation, it is required to have doxygen on your machine, Doxyfile is configured to automatically generate html and LaTex documentation. LaTex documentation can be convered in pdf using another Makefile in docs/latex/ folder, but this requires to install additional tools. See the documentation section

Additional tools

The makefile does integrate Doxygen tool which is is the de facto standard used for generating documentation from annotated C, C++, C#, python, others.. source code. (additional informations https://www.doxygen.nl/index.html).

HTML documentation only requires Doxygen LaTex (pdf) documentation requires additional tools (not mandatory if you do not need to generate the reference manual in pdf)

Install Doxygen

Install Doxygen (~100MB)

sudo apt-get install doxygen

Install Pdflatex

Install TexLive base (~150 MB)

sudo apt-get install texlive-latex-base

Install additional fonts (~1300MB)

sudo apt-get install texlive-fonts-recommended
sudo apt-get install texlive-fonts-extra

Matrix transpose problem

The matrix transpose is a basic problem in linear algebra and widely used in many applications. Matrix transpose: img_transpose

The following is a host-based (CPU serial version) implementation of the transpose algorithm using single precision floating point values. Suppose that the matrix is saved in a 1D array as row-major:

for (int iy=0;iy<ny;iy++)
	for(int ix=0;ix<nx;ix++)
		transposed[ix*ny+iy]=original[iy*nx+ix]

as you can see it's very simple algrithm, just a matter of exchange x with y. Now, even if this algorithm is very simple, when it's time to optimize it for a GPU the problem becomes way more difficult.. In this application there are 7 version of matrix transpose algorithm tested on TEGRA X1. Results from others GPUs are shown just to stress the fact that different architectures may have different results.

Setting an upper and lower bound

before seeing the matrix transpose kernels, in the file common.cu there are 2 kernels called copyRow and copyCol. Those examples are used to calculate rough upper and lower bounds (in terms of bandwidth) for all transpose kernels and in general for all kernels involving matrixes. bandwidth (GB/s or MB/s) is a widely used parameter in GPU, it is used to measure how fast a kernel is able to process datas

copyRow kernel that performs a copy of a matrix by loading and storing rows, this is an upper bound because all memory operations are coalesced.
copyCol version that performs a copy of a matrix by loading and storing rows, this is worst case scenario for matrix operations because all memory operations are done only with strided accesses.

Theoretical Peak bandwidth vs Effective Bandwidth

There are two types of bandwidth:

Theoretical
Effective

Theoretical is the absolute maximum bandwidth achievable with the hardware, usually declared from the company that sells teh GPU and for sure one of the most important parameter of a graphic card. Examples:

Fermi M2090: 177.6 GB/s
Tesla M2070: 150 GB/s (Launch Price 2011: 3099 USD)

Effective is the measured bandwidth that a kernel actually achieves, can be calculated using the following equation:

effBw= [ (bytes_read + bytes_written) x 10ˆ-9 ]/time_elapsed

The following measurements comes from a Fermi M2090 with 16x16 block and 2048x2048 matrix and provide an upper bound, which is about 70% of theoretical peak bandwidth, and a lower bound that is 33% of theoretical peak:

img_bw_m2090

Matrix Transpose Kernels implementation and CUDA profiler

From now on the results will be analyzed using nvprof command line profiler on Jetson Nano board.

1) transposeNaiveRow kernel

This is the first example of the transpose kernel, very straightforward to implement based on the host implementation. (following piece of code can be found in matrixTranspose.cu)

__global__ void transposeNaiveRow(float *in, float *out, unsigned int nx, unsigned int ny){
	unsigned int ix=blockDim.x * blockIdx.x + threadIdx.x;
	unsigned int iy=blockDim.y * blockIdx.y + threadIdx.y;
	if (ix>=nx || iy>=ny) return;
	out[ix*ny + iy]=in[iy*nx + ix];
}

Here there are the interesting results coming from the profiler with 8192x8192 matrix and 16x16 block:

global memory usage

| gld_efficiency | gst_efficiency | gld_transactions | gst_transactions | | :------------: | :------------: | :--------------: | :--------------: | | 100 % | 12.5 % | 2097154 | 4194304 |

shared memory usage

| shared_load_transactions_per_request | shared_store_transactions_per_request | shared_efficiency | | :------------: | :------------: | :--------------: | | 0 | 0 | 0 |

shared memory usage

| achieved_occupancy | branch_efficiency | | :------------: | :------------: | | 0.84 | 100% |

execution time = 0.035000 s

From these results is possible to understand that there is a bad usage of the memory due to strided global store operations.

2) transposeNaiveCol kernel

Column based kernel obtained from the previous version just by exchanging the read and write indices (following piece of code can be found in matrixTranspose.cu)

__global__ void transposeNaiveCol(float *in, float *out, unsigned int nx, unsigned int ny){
	unsigned int ix=blockDim.x * blockIdx.x + threadIdx.x;
	unsigned int iy=blockDim.y * blockIdx.y + threadIdx.y;
	if (ix>=nx || iy>=ny) return;
	out[iy*nx + ix]=in[ix*ny + iy];
}

Here there are the interesting results coming from the profiler with 8192x8192 matrix and 16x16 block:

global memory usage

| gld_efficiency | gst_efficiency | gld_transactions | gst_transactions | | :------------: | :------------: | :--------------: | :--------------: | | 25 % | 100 % | 2097154 | 524288 |

shared memory usage

| shared_load_transactions_per_request | shared_store_transactions_per_request | shared_efficiency | | :------------: | :------------: | :--------------: | | 0 | 0 | 0 |

shared memory usage

| achieved_occupancy | branch_efficiency | | :------------: | :------------: | | 0.83 | 100% |

execution time = 0.020000 s

Is possible to see that this kernel is for sure faster than the previous one (transposeNaiveRow). There are 2 important differences:

gst_efficiency (global store efficiency) has improoved from 12.5% to 100%, this is because the store operations are now coalesced, because of this also the gst_transaction are decresed by 1/8!!
the second difference is related to the global load efficiency (gld_efficiency) and the global load transactions (gld_transactions). Even if the efficiency got worst (100% to 25%) why are the transactions still the same? This happened because the strided load operation have been cached in the on-chip fast L1 cache, so even if the strided access pattern is bad, the L1 cache hits avoid transactions replay.

Note that store operation cannot be cached in L1 cache, that's why this kernel is faster than the orevious one!!

3) transposeUnrolling kernels

The following two examples are using a technique called "Block Unrolling". The goal of unrolling is to assign more independent work to each thread in order to maximize the concurrent memory requests and so maximize the device_occupancy and avoid to many stalls in the cuda cores. (following piece of code can be found in matrixTranspose.cu)

unrolled version of transposeNaiveRow kernel:

__global__ void transposeUnroll4Row(float *in, float *out, unsigned int nx, unsigned int ny)
{
	unsigned int ix = blockDim.x * blockIdx.x * 4 + threadIdx.x;
	unsigned int iy = blockDim.y * blockIdx.y + threadIdx.y;

	unsigned int ti = iy * nx + ix; // access in rows
	unsigned int to = ix * ny + iy; // access in columns

	if (ix + 3 * blockDim.x < nx && iy < ny)
	{
    	out[to]                   = in[ti];
    	out[to + ny * blockDim.x]   = in[ti + blockDim.x];
    	out[to + ny * 2 * blockDim.x] = in[ti + 2 * blockDim.x];
    	out[to + ny * 3 * blockDim.x] = in[ti + 3 * blockDim.x];
	}
}

global memory usage

| gld_efficiency | gst_efficiency | gld_transactions | gst_transactions | | :------------: | :------------: | :--------------: | :--------------: | | 100 % | 12.5 % | 2097154 | 4194304 |

shared memory usage

| shared_load_transactions_per_request | shared_store_transactions_per_request | shared_efficiency | | :------------: | :------------: | :--------------: | | 0 | 0 | 0 |

shared memory usage

| achieved_occupancy | branch_efficiency | | :------------: | :------------: | | 0.86 | 100% |

execution time = 0.075000 s

Worst than before!! this is because this kernel tries to issue more BAD independent operations

unrolled version of transposeNaiveCol kernel:

__global__ void transposeUnroll4Col(float *in, float *out, unsigned int nx, unsigned int ny)
{
	unsigned int ix = blockDim.x * blockIdx.x * 4 + threadIdx.x;
	unsigned int iy = blockDim.y * blockIdx.y + threadIdx.y

MatrixTransposeCUDA

Install / Use

README

matrix transpose algorithm CUDA-C

Additional tools

Install Doxygen

Install Pdflatex

Matrix transpose problem

Setting an upper and lower bound

Theoretical Peak bandwidth vs Effective Bandwidth

Matrix Transpose Kernels implementation and CUDA profiler

1) transposeNaiveRow kernel

2) transposeNaiveCol kernel

3) transposeUnrolling kernels