Asst3
Stanford CS149 -- Assignment 3
Install / Use
/learn @stanford-cs149/Asst3README
Assignment 3: A Simple CUDA Renderer
Due: Thursday Oct 30, 11:59PM PST
100 points total

Overview
In this assignment you will write a parallel renderer in CUDA that draws colored circles. While this renderer is very simple, parallelizing the renderer will require you to design and implement data structures that can be efficiently constructed and manipulated in parallel. This is a challenging assignment so you are advised to start early. Seriously, you are advised to start early. Good luck!
Environment Setup
-
You will collect results (i.e. run performance tests) for this assignment on GPU-enabled VMs on Amazon Web Services (AWS). Please follow the instructions in cloud_readme.md for setting up a machine to run the assignment.
-
Download the Assignment starter code from the course Github using:
git clone https://github.com/stanford-cs149/asst3
The CUDA C programmer's guide PDF version or web version is an excellent reference for learning how to program in CUDA. There are a wealth of CUDA tutorials and SDK examples on the web (just Google!) and on the NVIDIA developer site. In particular, you may enjoy the free Udacity course Introduction to Parallel Programming in CUDA.
Table 21 in the CUDA C Programming Guide is a handy reference for the maximum number of CUDA threads per thread block, size of thread block, shared memory, etc for the NVIDIA T4 GPUs you will used in this assignment. NVIDIA T4 GPUs support CUDA compute capability 7.5.
For C++ questions (like what does the virtual keyword mean), the C++ Super-FAQ is a great resource that explains things in a way that's detailed yet easy to understand (unlike a lot of C++ resources), and was co-written by Bjarne Stroustrup, the creator of C++!
Part 1: CUDA Warm-Up 1: SAXPY (5 pts)
To gain a bit of practice writing CUDA programs your warm-up task is to re-implement the SAXPY function
from Assignment 1 in CUDA. Starter code for this part of the assignment is located in the /saxpy directory
of the assignment repository. You can build and run the saxpy CUDA program by calling make and ./cudaSaxpy in the /saxpy directory.
Please finish off the implementation of SAXPY in the function saxpyCuda in saxpy.cu. You will need to allocate device global memory arrays and copy the contents of the host input arrays X, Y, and result into CUDA device memory prior to performing the computation. After the CUDA computation is complete, the result must be copied back into host memory. Please see the definition of cudaMemcpy function in Section 3.2.2 of the Programmer's Guide (web version), or take a look at the helpful tutorial pointed to in the assignment starter code.
As part of your implementation, add timers around the CUDA kernel invocation in saxpyCuda. After your additions, your program should time two executions:
-
The provided starter code contains timers that measure the entire process of copying data to the GPU, running the kernel, and copying data back to the CPU.
-
You should also insert timers the measure only the time taken to run the kernel. (They should not include the time of CPU-to-GPU data transfer or transfer of results from the GPU back to the CPU.)
When adding your timing code in the latter case, you'll need to be careful: By defult a CUDA kernel's execution on the GPU is asynchronous with the main application thread running on the CPU. For example, if you write code that looks like this:
double startTime = CycleTimer::currentSeconds();
saxpy_kernel<<<blocks, threadsPerBlock>>>(N, alpha, device_x, device_y, device_result);
double endTime = CycleTimer::currentSeconds();
You'll measure a kernel execution time that seems amazingly fast! (Because you are only timing the cost of the API call itself, not the cost of actually executing the resulting computation on the GPU.
Therefore, you will want to place a call to cudaDeviceSynchronize() following the
kernel call to wait for completion of all CUDA work on the GPU. This call to cudaDeviceSynchronize() returns when all prior CUDA work on the GPU has completed. Note that cudaDeviceSynchronize() is not necessary after the cudaMemcpy() to ensure the memory transfer to the GPU is complete, since cudaMempy() is synchronous under the conditions we are using it. (For those that wish to know more, see this documentation.)
double startTime = CycleTimer::currentSeconds();
saxpy_kernel<<<blocks, threadsPerBlock>>>(N, alpha, device_x, device_y, device_result);
cudaDeviceSynchronize();
double endTime = CycleTimer::currentSeconds();
Note that in your measurements that include the time to transfer to and from the CPU, a call to cudaDeviceSynchronize() is not necessary before the final timer (after your call to cudaMemcopy() that returns data to the CPU) because cudaMemcpy() will not return to the calling thread until after the copy is complete.
Question 1. What performance do you observe compared to the sequential CPU-based implementation of SAXPY (recall your results from saxpy on Program 5 from Assignment 1)?
Question 2. Compare and explain the difference between the results provided by two sets of timers (timing only the kernel execution vs. timing the entire process of moving data to the GPU and back in addition to the kernel execution). Are the bandwidth values observed roughly consistent with the reported bandwidths available to the different components of the machine? (You should use the web to track down the memory bandwidth of an NVIDIA T4 GPU. Hint: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf. The expected bandwidth of memory bus of AWS is 5.3 GB/s, which does not match that of a 16-lane PCIe 3.0. Several factors prevent peak bandwidth, including CPU motherboard chipset performance and whether or not the host CPU memory used as the source of the transfer is “pinned” — the latter allows the GPU to directly access memory without going through virtual memory address translation. If you are interested, you can find more info here: https://kth.instructure.com/courses/12406/pages/optimizing-host-device-data-communication-i-pinned-host-memory)
Part 2: CUDA Warm-Up 2: Parallel Prefix-Sum (10 pts)
Now that you're familiar with the basic structure and layout of CUDA programs, as a second exercise you are asked to come up with parallel implementation of the function find_repeats which, given a list of integers A, returns a list of all indices i for which A[i] == A[i+1].
For example, given the array {1,2,2,1,1,1,3,5,3,3}, your program should output the array {1,3,4,8}.
Exclusive Prefix Sum
We want you to implement find_repeats by first implementing parallel exclusive prefix-sum operation.
Exlusive prefix sum takes an array A and produces a new array output that has, at each index i, the sum of all elements up to but not including A[i]. For example, given the array A={1,4,6,8,2}, the output of exclusive prefix sum output={0,1,5,11,19}.
The following "C-like" code is an iterative version of scan. In the pseudocode before, we use parallel_for to indicate potentially parallel loops. This is the same algorithm we discussed in class: https://gfxcourses.stanford.edu/cs149/fall25/lecture/dataparallel/slide_17
void exclusive_scan_iterative(int* start, int* end, int* output) {
int N = end - start;
memmove(output, start, N*sizeof(int));
// upsweep phase
for (int two_d = 1; two_d <= N/2; two_d*=2) {
int two_dplus1 = 2*two_d;
parallel_for (int i = 0; i < N; i += two_dplus1) {
output[i+two_dplus1-1] += output[i+two_d-1];
}
}
output[N-1] = 0;
// downsweep phase
for (int two_d = N/2; two_d >= 1; two_d /= 2) {
int two_dplus1 = 2*two_d;
parallel_for (int i = 0; i < N; i += two_dplus1) {
int t = output[i+two_d-1];
output[i+two_d-1] = output[i+two_dplus1-1];
output[i+two_dplus1-1] += t;
}
}
}
We would like you to use this algorithm to implement a version of parallel prefix sum in CUDA. You must implement exclusive_scan function in scan/scan.cu. Your implementation will consist of both host and device code. The implementation will require multiple CUDA kernel launches (one for each parallel_for loop in the pseudocode above).
Note: In the starter code, the reference solution scan implementation above assumes that the input array's length (N) is a power of 2. In the cudaScan function, we solve this problem by rounding the input array length to the next power of 2 when allocating the corresponding buffers on the GPU. However, the code only copies back N elements from the GPU buffer back to the CPU buffer. This fact should simplify your CUDA implementation.
Compilation produces the binary cudaScan. Commandline usage is as follows:
Usage: ./cudaScan [options]
Program Options:
-m --test <TYPE> Run specified function on input. Valid tests are: scan, find_repeats (default: scan)
-i --input <NAME> Run test on given input type. Valid inputs are: ones, random (default: random)
-n --arraysize <INT> Number of elements in arrays
-t --thrust Use Thrust library implementation
-? --help This message
Implementing "Find Repeats" Using Prefix Sum
Once you have written `exclu
