Homework #2 (Gaussian Elimination)

Due: Friday, February 9, 2024

CIS-4320

If you are an undergraduate student or a graduate student, do the following:

Modify the serial version of the Gaussian elimination program to parallelize it using pthreads, following the discussion in class and the Solarium example. Use a maximum of eight threads. The zip archive GaussianC-VLA.zip is an Eclipse project containing the serial version using C99-style variable length arrays. The makefiles for the release and debug versions have been generated by Eclipse, but you don't need Eclipse to run them. Use make all in the appropriate folder to build each configuration.

For this assignment, create a team of threads for each iteration of the outermost loop and let that team do the row manipulations below the i^th row. Assign a contiguous block of rows to each thread. Be sure to join with all the threads before proceeding to the next iteration of i. Yes, this causes threads to be created and destroyed many times. We will address that issue in later assignments.
The file GaussianData.zip contains some test data in a suitable format.

On Lemuria measure the speed-up of your program relative to the serial version. Measure the speed-ups for different problem sizes running from 1000x1000 to 2000x2000 (inclusive) using the test data provided. Make a graph showing speed-up vs. problem size.

When estimating the speed-up, be sure nobody else is making heavy use of the machine at the same time. Otherwise, the OS will multitask between users, making your results difficult to interpret. Also, it is best to run both the serial version and the parallel version three times, choosing the fastest run of the three (not the average) as your official result. The fastest run is most reflective of what the hardware is capable of doing without interference from the system.

Be sure to use the "Release" version in your testing!
As the problem size increases, you might expect the speed-up to improve since the overhead of creating and destroying the threads will be proportionally smaller as the amount of work the threads have to do increases. However, for the larger problem sizes, the matrix of coefficients will exceed the size of the processor's memory cache (12MB/CPU on Lemuria), and memory access efficiency will decline. Do you see that effect? What problem size causes the matrix of coefficients to be the same size as the processor cache? Keep in mind that Lemuria has two processors.
Try the 5000x5000 problem and compute the speed-up relative to the serial version. You don't need to add this value to your graph, but does it follow the trend you are seeing?
Gaussian elimination is a cubic-time algorithm. Thus, the 2000x2000 size problem should take 2³ = 8 times longer than the 1000x1000 size problem. Also, the 5000x5000 size problem should take 5³ = 125 times longer than the 1000x1000 size problem. Was this true for you?

Information about the processors in lemuria is here. Note that lemuria contains two of these processors (and thus two processor caches).

OPTIONAL: Consider running the experiment on a cluster node to see if and how it behaves differently. Note that the cluster node processors are these.

Keep these programs handy for future reference. In later assignments, I will ask you to make further improvements to this program, and you will want the version you write here to use as a comparison.

CIS-5230

If you are a graduate student, you should also do the following:

To improve the memory access patterns of the program, let's try a simple device. Each time the threads do a round of work, have them switch directions. For example, during the first iteration of the outermost loop, the threads should process rows from top to bottom. Then during the next iteration, the threads should process rows from bottom to top, and continue alternating directions from iteration to iteration. Thus, the data loaded into the cache at the end of iteration i will be immediately available at the beginning of iteration i+1.

Modify both the serial and parallel versions of your program to use this approach and remeasure the performances and speed-ups. Do you observe any improvement? You would expect improvement, if any, to occur for the larger problem sizes where the matrix of coefficients is too large for the cache. Does that happen? [Note: When I did this, I saw some improvement, but it was quite modest]

Submission

Submit your modified gaussian.c file along with the results of your timings, graphs, and associated comments in a zip archive to Canvas. If you are a graduate student, you only need to submit your final program (the one that alternates directions).