MatrixTransposeCUDA
CUDA C simple application for Nvidia's GPU
Install / Use
/learn @Luca-Dalmasso/MatrixTransposeCUDAREADME
matrix transpose algorithm CUDA-C
Simple example of a typical work flow to solve and optimize algorithm running on GPU<br> with the support of command line nvprof profiling tool from Nvidia. <br> NOTE: this project has been tested on Nvidia's Tegra X1 GPU (Maxwell architecture)<br> you might have different behaviour if you run the code on a different GPU<br> NOTE: it is required to have installed Nvidia CUDA Toolkit on your machine<br> NOTE: the makefile also supports the doxygen documentation, it is required to have<br> doxygen on your machine, Doxyfile is configured to automatically generate html<br> and LaTex documentation. LaTex documentation can be convered in pdf using another<br> Makefile in docs/latex/ folder, but this requires to install additional tools.<br> See the documentation section <br>
Additional tools
The makefile does integrate Doxygen tool which is is the de facto standard used for generating documentation <br> from annotated C, C++, C#, python, others.. source code. (additional informations https://www.doxygen.nl/index.html).<br>
HTML documentation only requires Doxygen <br> LaTex (pdf) documentation requires additional tools (not mandatory if you do not need to generate the reference manual in pdf)
Install Doxygen
Install Doxygen (~100MB)
sudo apt-get install doxygen
Install Pdflatex
Install TexLive base (~150 MB)
sudo apt-get install texlive-latex-base
Install additional fonts (~1300MB)
sudo apt-get install texlive-fonts-recommended
sudo apt-get install texlive-fonts-extra
Matrix transpose problem
The matrix transpose is a basic problem in linear algebra and widely used in many applications.<br>
Matrix transpose:<br>
<br>
The following is a host-based (CPU serial version) implementation of the transpose algorithm using single precision floating point values. Suppose that the matrix is saved in a 1D array as row-major:
for (int iy=0;iy<ny;iy++)
for(int ix=0;ix<nx;ix++)
transposed[ix*ny+iy]=original[iy*nx+ix]
as you can see it's very simple algrithm, just a matter of exchange x with y.<br> <br> Now, even if this algorithm is very simple, when it's time to optimize it for a GPU the problem becomes way more difficult..<br> In this application there are 7 version of matrix transpose algorithm tested on TEGRA X1. Results from others GPUs are shown just to stress the fact that different architectures may have different results.
Setting an upper and lower bound
before seeing the matrix transpose kernels, in the file common.cu there are 2 kernels called copyRow and copyCol.<br> Those examples are used to calculate rough upper and lower bounds (in terms of bandwidth) for all transpose kernels and in general for all kernels involving matrixes.<br> bandwidth (GB/s or MB/s) is a widely used parameter in GPU, it is used to measure how fast a kernel is able to process datas
- copyRow kernel that performs a copy of a matrix by loading and storing rows, this is an upper bound because all memory operations are coalesced.
- copyCol version that performs a copy of a matrix by loading and storing rows, this is worst case scenario for matrix operations because all memory operations are done only with strided accesses.
Theoretical Peak bandwidth vs Effective Bandwidth
There are two types of bandwidth: <br>
- Theoretical
- Effective
Theoretical is the absolute maximum bandwidth achievable with the hardware, usually declared from the company that sells teh GPU and for sure one of the most important parameter of a graphic card.<br> Examples:
- Fermi M2090: 177.6 GB/s
- Tesla M2070: 150 GB/s (Launch Price 2011: 3099 USD)
Effective is the measured bandwidth that a kernel actually achieves, can be calculated using the following equation:<br>
effBw= [ (bytes_read + bytes_written) x 10ˆ-9 ]/time_elapsed
The following measurements comes from a Fermi M2090 with 16x16 block and 2048x2048 matrix and provide an upper bound, which is about 70% of theoretical peak bandwidth, and a lower bound that is 33% of theoretical peak:

Matrix Transpose Kernels implementation and CUDA profiler
From now on the results will be analyzed using nvprof command line profiler on Jetson Nano board.
1) transposeNaiveRow kernel
This is the first example of the transpose kernel, very straightforward to implement based on the host implementation. (following piece of code can be found in matrixTranspose.cu)
__global__ void transposeNaiveRow(float *in, float *out, unsigned int nx, unsigned int ny){
unsigned int ix=blockDim.x * blockIdx.x + threadIdx.x;
unsigned int iy=blockDim.y * blockIdx.y + threadIdx.y;
if (ix>=nx || iy>=ny) return;
out[ix*ny + iy]=in[iy*nx + ix];
}
Here there are the interesting results coming from the profiler with 8192x8192 matrix and 16x16 block:
- global memory usage
| gld_efficiency | gst_efficiency | gld_transactions | gst_transactions | | :------------: | :------------: | :--------------: | :--------------: | | 100 % | 12.5 % | 2097154 | 4194304 |
- shared memory usage
| shared_load_transactions_per_request | shared_store_transactions_per_request | shared_efficiency | | :------------: | :------------: | :--------------: | | 0 | 0 | 0 |
- shared memory usage
| achieved_occupancy | branch_efficiency | | :------------: | :------------: | | 0.84 | 100% |
- execution time = 0.035000 s
From these results is possible to understand that there is a bad usage of the memory due to strided global store operations.
2) transposeNaiveCol kernel
Column based kernel obtained from the previous version just by exchanging the read and write indices (following piece of code can be found in matrixTranspose.cu)
__global__ void transposeNaiveCol(float *in, float *out, unsigned int nx, unsigned int ny){
unsigned int ix=blockDim.x * blockIdx.x + threadIdx.x;
unsigned int iy=blockDim.y * blockIdx.y + threadIdx.y;
if (ix>=nx || iy>=ny) return;
out[iy*nx + ix]=in[ix*ny + iy];
}
Here there are the interesting results coming from the profiler with 8192x8192 matrix and 16x16 block:
- global memory usage
| gld_efficiency | gst_efficiency | gld_transactions | gst_transactions | | :------------: | :------------: | :--------------: | :--------------: | | 25 % | 100 % | 2097154 | 524288 |
- shared memory usage
| shared_load_transactions_per_request | shared_store_transactions_per_request | shared_efficiency | | :------------: | :------------: | :--------------: | | 0 | 0 | 0 |
- shared memory usage
| achieved_occupancy | branch_efficiency | | :------------: | :------------: | | 0.83 | 100% |
- execution time = 0.020000 s
Is possible to see that this kernel is for sure faster than the previous one (transposeNaiveRow).<br> There are 2 important differences:
- gst_efficiency (global store efficiency) has improoved from 12.5% to 100%, this is because the store operations are now coalesced, because of this also the gst_transaction are decresed by 1/8!!
- the second difference is related to the global load efficiency (gld_efficiency) and the global load transactions (gld_transactions).<br> Even if the efficiency got worst (100% to 25%) why are the transactions still the same?<br> This happened because the strided load operation have been cached in the on-chip fast L1 cache, so even if the strided access pattern is bad, the L1 cache hits avoid transactions replay.
Note that store operation cannot be cached in L1 cache, that's why this kernel is faster than the orevious one!!
3) transposeUnrolling kernels
The following two examples are using a technique called "Block Unrolling".<br> The goal of unrolling is to assign more independent work to each thread in order to maximize the concurrent memory requests and so maximize the device_occupancy and avoid to many stalls in the cuda cores.<br> (following piece of code can be found in matrixTranspose.cu)<br>
unrolled version of transposeNaiveRow kernel:
__global__ void transposeUnroll4Row(float *in, float *out, unsigned int nx, unsigned int ny)
{
unsigned int ix = blockDim.x * blockIdx.x * 4 + threadIdx.x;
unsigned int iy = blockDim.y * blockIdx.y + threadIdx.y;
unsigned int ti = iy * nx + ix; // access in rows
unsigned int to = ix * ny + iy; // access in columns
if (ix + 3 * blockDim.x < nx && iy < ny)
{
out[to] = in[ti];
out[to + ny * blockDim.x] = in[ti + blockDim.x];
out[to + ny * 2 * blockDim.x] = in[ti + 2 * blockDim.x];
out[to + ny * 3 * blockDim.x] = in[ti + 3 * blockDim.x];
}
}
- global memory usage
| gld_efficiency | gst_efficiency | gld_transactions | gst_transactions | | :------------: | :------------: | :--------------: | :--------------: | | 100 % | 12.5 % | 2097154 | 4194304 |
- shared memory usage
| shared_load_transactions_per_request | shared_store_transactions_per_request | shared_efficiency | | :------------: | :------------: | :--------------: | | 0 | 0 | 0 |
- shared memory usage
| achieved_occupancy | branch_efficiency | | :------------: | :------------: | | 0.86 | 100% |
- execution time = 0.075000 s
Worst than before!! this is because this kernel tries to issue more BAD independent operations
unrolled version of transposeNaiveCol kernel:
__global__ void transposeUnroll4Col(float *in, float *out, unsigned int nx, unsigned int ny)
{
unsigned int ix = blockDim.x * blockIdx.x * 4 + threadIdx.x;
unsigned int iy = blockDim.y * blockIdx.y + threadIdx.y
