Due: Friday, May 3, 2024
Reading: Information on block level shared memory.
Create a version of the linear equations sample program that uses CUDA instead of POSIX threads to solve the Gaussian elimination problem.
Try to organize your program to use a single GPU thread for each row being processed. Does this seem like an effective strategy for using CUDA?
To directly compare times with the other versions of this program we created, use the type double if possible. Be aware that some, older CUDA-capable GPUs may not support double.
You can use global shared memory for the first version of the program. However, performance might be better if you modified your program to use block-level shared memory to accelerate the computation. The "base" row read by all threads is a good candidate to cache in the shared memory. However, be aware that the amount of shared memory available is limited (about 48K on the cards we are using).
Think about copying the matrix of coefficients and the driving vector to the device memory just once at the beginning, and copying the results back to the host memory only at the end. You might find this tricky to do. It might entail using "nested" kernel calls, that is a kernel that calls another kernel. Consider this a "stretch goal."
If you are a graduate student, do the undergraduate assignment above, and then do:
Reimplement the assignment above using either OpenCL or OpenACC instead of CUDA.