Assignment 1: Performance Analysis on a Quad-Core CPU

Due Mon Oct 6, 11:59pm

100 points total + 6 points extra credit

Overview

This assignment is intended to help you develop an understanding of the two primary forms of parallel execution present in a modern multi-core CPU:

SIMD execution within a single processing core
Parallel execution using multiple cores (You'll see effects of Intel Hyper-threading as well.)

You will also gain experience measuring and reasoning about the performance of parallel programs (a challenging, but important, skill you will use throughout this class). This assignment involves only a small amount of programming, but a lot of analysis!

Environment Setup

You will need to run code on the new myth machines for this assignment (Hostnames for these machines are myth[51-66].stanford.edu). If for some reason you do not have a home directory on the myth machines, submit a HelpSU ticket here.

These machines contain four-core 4.2 GHz Intel Core i7 processors (although dynamic frequency scaling can take them to 4.5 GHz when the chip decides it is useful and possible to do so). Each core in the processor supports two hardware threads (Intel calls this "Hyper-Threading") and the cores can execute AVX2 vector instructions which describe simultaneous execution of the same eight-wide SIMD operation on a vector of eight single-precision values. For the curious, a complete specification for this CPU can be found at https://www.intel.com/content/www/us/en/products/sku/97129/intel-core-i77700k-processor-8m-cache-up-to-4-50-ghz/specifications.html. Students that want to dig deeper might enjoy this writeup.

Note: For grading purposes, we expect you to report on the performance of code run on the Stanford myth machines, however for kicks, you may also want to run the programs in this assignment on your own machine. (You will first need to install the Intel SPMD Program Compiler (ISPC) available here: http://ispc.github.io/). Feel free to include your findings from running code on other machines in your report as well, just be very clear what machine you were running on.

To get started:

ISPC is needed to compile many of the programs used in this assignment. ISPC can be easily installed on the myth machines through the following steps:

From a myth machine, download the linux binary into a local directory of your choice. You can get ISPC compiler binaries for Linux from the ISPC downloads page. From myth, we recommend you use wget to directly download the binary from the downloads page. As of Fall 2025 Week 1, the wget line below works:

wget https://github.com/ispc/ispc/releases/download/v1.28.1/ispc-v1.28.1-linux.tar.gz

Untar the downloaded file: tar -xvf ispc-v1.28.1-linux.tar.gz

Add the ISPC bin directory to your system path. For example, if untarring the downloaded file produces the directory ~/Downloads/ispc-v1.28.1-linux, in bash you'd update your path variable with:

export PATH=$PATH:${HOME}/Downloads/ispc-v1.28.1-linux/bin

The above line can be added to your .bashrc file for permanence.

If you are using csh, you'll update your PATH using setenv. A quick Google search will teach you how.

The assignment starter code is available on https://github.com/stanford-cs149/asst1. Please clone the Assignment 1 starter code using:

git clone https://github.com/stanford-cs149/asst1.git

Program 1: Parallel Fractal Generation Using Threads (20 points)

Build and run the code in the prog1_mandelbrot_threads/ directory of the code base. (Type make to build, and ./mandelbrot to run it.) This program produces the image file mandelbrot-serial.ppm, which is a visualization of a famous set of complex numbers called the Mandelbrot set. Most platforms have a .ppm view. To view the resulting images remotely, first make sure that you have X server. No downloads are needed for Linux systems. However, for Mac you can use Xquartz and for Windows you can use VcXsrv. After you have SSH X-Forwarding support, make sure you ssh -Y onto a myth machine and you can then view the images using the display command. As you can see in the images below, the result is a familiar and beautiful fractal. Each pixel in the image corresponds to a value in the complex plane, and the brightness of each pixel is proportional to the computational cost of determining whether the value is contained in the Mandelbrot set. To get image 2, use the command option --view 2. (See function mandelbrotSerial() defined in mandelbrotSerial.cpp). You can learn more about the definition of the Mandelbrot set at http://en.wikipedia.org/wiki/Mandelbrot_set.

Mandelbrot Set

Your job is to parallelize the computation of the images using std::thread. Starter code that spawns one additional thread is provided in the function mandelbrotThread() located in mandelbrotThread.cpp. In this function, the main application thread creates another additional thread using the constructor std::thread(function, args...) It waits for this thread to complete by calling join on the thread object. Currently the launched thread does not do any computation and returns immediately. You should add code to workerThreadStart function to accomplish this task. You will not need to make use of any other std::thread API calls in this assignment.

What you need to do:

Modify the starter code to parallelize the Mandelbrot generation using two processors. Specifically, compute the top half of the image in thread 0, and the bottom half of the image in thread 1. This type of problem decomposition is referred to as spatial decomposition since different spatial regions of the image are computed by different processors.
Extend your code to use 2, 3, 4, 5, 6, 7, and 8 threads, partitioning the image generation work accordingly (threads should get blocks of the image). Note that the processor only has four cores but each core supports two hyper-threads, so it can execute a total of eight threads interleaved on its execution contents. In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1. Is speedup linear in the number of threads used? In your writeup hypothesize why this is (or is not) the case? (you may also wish to produce a graph for VIEW 2 to help you come up with a good answer. Hint: take a careful look at the three-thread datapoint.)
To confirm (or disprove) your hypothesis, measure the amount of time each thread requires to complete its work by inserting timing code at the beginning and end of workerThreadStart(). How do your measurements explain the speedup graph you previously created?
Modify the mapping of work to threads to achieve to improve speedup to at about 7-8x on both views of the Mandelbrot set (if you're above 7x that's fine, don't sweat it). You may not use any synchronization between threads in your solution. We are expecting you to come up with a single work decomposition policy that will work well for all thread counts---hard coding a solution specific to each configuration is not allowed! (Hint: There is a very simple static assignment that will achieve this goal, and no communication/synchronization among threads is necessary.). In your writeup, describe your approach to parallelization and report the final 8-thread speedup obtained.
Now run your improved code with 16 threads. Is performance noticably greater than when running with eight threads? Why or why not?

Program 2: Vectorizing Code Using SIMD Intrinsics (20 points)

Take a look at the function clampedExpSerial in prog2_vecintrin/main.cpp of the Assignment 1 code base. The clampedExp() function raises values[i] to the power given by exponents[i] for all elements of the input array and clamps the resulting values at 9.999999. In program 2, your job is to vectorize this piece of code so it can be run on a machine with SIMD vector instructions.

However, rather than craft an implementation using SSE or AVX2 vector intrinsics that map to real SIMD vector instructions on modern CPUs, to make things a little easier, we're asking you to implement your version using CS149's "fake vector intrinsics" defined in CS149intrin.h. The CS149intrin.h library provides you with a set of vector instructions that operate on vector values and/or vector masks. (These functions don't translate to real CPU vector instructions, instead we simulate these operations for you in our library, and provide feedback that makes for easier debugging.) As an example of using the CS149 intrinsics, a vectorized version of the abs() function is given in main.cpp. This example contains some basic vector loads and stores and manipulates mask registers. Note that the abs() example is only a simple example, and in fact the code does not correctly handle all inputs! (We will let you figure out why!) You may wish to read through all the comments and function definitions in CS149intrin.h to know what operations are available to you.

Here are few hints to help you in your implementation:

Every vector instruction is subject to an optional mask parameter. The mask parameter defines which lanes whose output is "mask

Asst1

Install / Use

README

Assignment 1: Performance Analysis on a Quad-Core CPU

Overview

Environment Setup

Program 1: Parallel Fractal Generation Using Threads (20 points)

Program 2: Vectorizing Code Using SIMD Intrinsics (20 points)