Parallel Programming (CIS-4230) Home Page
This is the home page for Peter Chapin's CIS-4230 course notes for the Spring 2024 semester.
Here you will find electronic versions of class slides, homework assignments, program samples,
and links to other references of interest. If you are a student taking Parallel Programming, you
should bookmark this page.
- The course syllabus gives an overview of the course and
its content, lists course resources, and describes the grading policy and related issues.
- The homework submission area and grade book are on Canvas but all other course resources
are here.
- There are several books in
the VSC O'Reilly ebook collection that support this course.
- I am (slowly) writing a pthread
tutorial. It is written more for the concurrent programmer, but many of the topics are
directly applicable to the parallel programming we are doing here. The source code for the tutorial, with
sample programs, is on GitHub.
- I have prepared a short document on using Git to help you get
familiar with the system. Some of the samples I will use in this course are on GitHub.
- Solarium is a solar system simulator that
we will use as an example of a parallel programming application. It is a particular instance
of the general "n-body problem."
- My Parallel "Body-of-Knowledge" repository
contains many other examples of parallel programs, in various states of completeness and
correctness.
- My home page contains other resources of potential interest.
Topics
- 2024-01-22.
Course introduction and overview.
- 2024-01-24.
Described the Sum sample program that adds all the items in an array.
- 2024-01-29.
Introduced the Gaussian elimination algorithm and its serial implementation.
- 2024-01-31.
Introduced Homework #2 on Gaussian elimination. Described
Solarium, both the serial and parallel versions.
- 2024-02-05.
Hints for handling the thread function in Homework #2. Demonstrated the effects of different
memory access patterns and caching using the Fortran version of
Gaussian elimination.
- 2024-02-07.
Discussed barriers and thread pools using the Solarium example.
- 2024-02-12.
Discussed the Sudoko board example illustrating how a recursive function can be parallelized.
- 2024-02-14.
More discussion about the parallel version of the Sudoku solver, showing a refined version of
the earlier program. Introduced a QuickSort implementation as a second example of parallel
recursion.
- 2024-02-19.
Finished discussing parallel QuickSort. Introduced OpenMP and demonstrated the OpenMP version
of Solarium.
- 2024-02-21.
Discussed Amdahl's Law and Flynn's Taxonomy. Introduced the demonstration OpenMP program.
- 2024-02-26. No class (Vacation).
- 2024-02-28. No class (Vacation).
- 2024-03-04.
Introduced MPI. Showed the MPI demonstration program and
demonstrated how to execute it on the VTSU cluster.
- 2024-03-06.
Discussed and demonstrated the MPI version of Solarium.
- 2024-03-11.
Introduced Homework #4 on using MPI and OpenMP to compute one
million digits of π.
- 2024-03-13.
Discussed a sample program that computes e using GMP and
MPFR. Demonstrated how to profile the BarnesHut version of Solarium to assess the relative
performance of octree construction vs. force computation.
- 2024-03-18.
Discussed some CUDA samples. Discussed the CUDA version of
Solarium using the "all pairs" algorithm.
- 2024-03-20.
More details about the CUDA version of Solarium. In particular, its use of block-local shared
memory, and the way it uses type float (and why). Discussed the idea of applying CUDA to the
BarnesHut version of Solarium.
- 2023-03-25.
Introduced OpenCL.
- 2024-03-27.
Discussed Homework #5 on CUDA. Reviewed the resource links on
OpenACC and SYCL, etc. Introduced the voltage field problem.
- 2024-04-01.
Discussed the pthread-barriers version of the voltage field program. Discussed some issues
associated with parallelizing the problem on shared memory systems (atomic access to shared
values).
- 2024-04-03. No class.
- 2024-04-08. No class (Vacation).
- 2024-04-10. No class (Vacation).
- 2024-04-15.
Demonstrated the Ada version of Solarium, highlighting the distinction between language-based
and library-based thread management.
- 2024-04-17.
Demonstrated the Scala version of Solarium, showing one way of doing parallel programming in a
functional context.
- 2024-04-22.
Guest presentation from Henry Course demonstrating parallel programming using Julia.
- 2024-04-24.
Introduced the ideas instruction level parallelism, memory models, and thread interference in
space and time.
Slides
Homework
- Homework #1. Introduction. In this assignment, you will set
up your environment on Lemuria and experiment with a simple parallel program that adds
elements in an array. Due: 2024-02-02
- Homework #2. Gaussian Elimination. In this assignment, you
will parallelize the Gaussian elimination program. Due: 2024-02-09
- Homework #3. Barriers and Thread Pools. In this assignment,
you will further enhance the Gaussian elimination program to use first barriers, and then
again for thread pools. This is to compare if and how those methods improve the performance of
the simple parallel version you made in the previous assignment. Due: 2024-02-23
- Homework #4. Computing π. In this assignment you will use
a combination of MPI and OpenMP to compute π to one million decimal digits of precision
with Ramanujan's formula. Due: 2024-04-19
- Homework #5 Gaussian Elimination using CUDA. In this
assignment, you will implement a parallel Gaussian elimination program using CUDA that
processes rows with individual threads. Due: 2024-05-03
Samples
Most, if not all, of these samples can be found in my Parallel "Body-of-Knowledge" repository on
GitHub. The versions in the repository are likely to be newer than the versions linked here.
Resources/Articles
General Resources
Gaussian Elimination
Fundamental Concepts
- Wikipedia's page on Amdahl's Law
puts a limit on speed up as the number of processors increases without bound.
- Wikipedia's page on Gustafson's Law explains why
Amdahl's Law isn't as much of a problem as it may appear.
- Wikipedia's page on Flynn's
Taxonomy.
Memory Access Issues
- What Every
Programmer Should Know about Memory by Ulrich Drepper is a detailed discussion of how
memory works from a software person's point of view. This article is quite technical and
assumes a significant comfort level with hardware topics. The article is fairly old, but still
relevant. Here is a Stack
Overflow thread discussing some updated topics related to the article.
- The Linux Journal article Memory Ordering in Modern Microprocessors (2005) by Paul E.
McKenney discusses some interesting issues surrounding memory access in multiprocessor
systems. Part I, Part II.
Programming Languages
OpenMP
- OpenMP is a thread abstraction for parallel programming
that we will investigate in this course.
- OpenMP Specification. A direct link to the
OpenMP specifications on the OpenMP site.
- OpenMP tutorial. A tutorial on
OpenMP.
The N-Body Problem
- Parallel N-Body Simulations. A
page discussing approaches for solving N-body simulations efficiently.
- Wikipedia's article on Octrees gives an
overview of the data structure that is central to Barnes-Hut.
MPI
- MPI Forum. MPI is a message passing abstraction that is
widely used in cluster environments. This site is the home for the official MPI standard.
- MPI Specification version 4.1. This s the latest version as of
November 2, 2023.
- The User's Guide to MPI is a tutorial on MPI programming.
It is a bit easier to read than the MPI standard itself.
- MPICH. A cross-platform implementation of MPI with a
high degree of standards conformance.
- Open MPI. Another well-respected implementation of
MPI targetting Unix-like systems.
- The Microsoft
implementation of MPI for Windows.
GPU Programming & CUDA
OpenCL
OpenACC & SYCL
Memory Models, Lock-Free Programming, Etc.
General Information
- ILP.asm is a short assembly language program that demonstrates the
effects of instruction level parallelism.
- A Wikipedia article on
non-blocking algorithms with links to articles on many related concepts.
- The ABA Problem is a problem that
arises in the design of lock-free data structures, especially those that use the CAS
primitive.
Here are some related papers.
- The Java Memory Model by Jeremy Manson, William
Pugh, and Sarita Adve. Principles of Programming Languages, January 2005, pages 378-391. This
paper describes the memory model used by Java starting with Java v5.
- C++ and the Perils of Double-Checked
Locking by Scott Meyers and Andrei Alexandrescu. Dr. Dobb's Journal July/August 2004.
In this paper, the authors describe some of the problems inherent in the pre-C++2011 standard
with regard to writing correct multithreaded programs in C++.
- Lock-free Dynamically Resizable
Arrays by Damian Dechev, Peter Pirkelbauer, and Bjarne Stroustrup. This paper talks
about a lock-free design for C++'s std::vector container and compares its performance with
traditional designs using locks. This paper predates the C++ 2011 standard, which provided a
well-defined memory model for C++.
Machine Information
- You will be using lemuria extensively in this course. This
document provides information on configuring your lemuria account for effective access to the
cluster and for running MPI programs on the cluster.
- Intel
Xeon E5620 is the processor in lemuria. It is a 64-bit quad-core processor with
hyper-threading, clocked at 2.40 GHz, with an 12 MiB L2 cache. Lemuria is a dual-processor
machine (with two E5620 processors for a total of 16 hardware threads).
- Intel
Xeon W3520 is the processor in the VTC cluster nodes. It is a 64-bit quad-core processor
with hyper-threading, clocked at 2.67 GHz, and with an 8 MiB L3 cache. The formal datasheets
for the 3500 series Xeon processors are here: Volume 1, Volume 2.
- The GPU on the cluster nodes is NVIDIA's GeForce GTX 1060. This GPU is
CUDA-capable.
- Intel64 and Intel IA-32
documentation direct from Intel.
Last Revised: 2024-04-25
© Copyright 2024 by Peter Chapin
<peter.chapin@vermontstate.edu>