Homework 1 // Due at Lecture Friday, February 18, 2011
Perform this assignment on malbec.cs.wisc.edu,
a 64-thread Sun UltraSparc-T2, where we have activated a CS account for you.
(Unless you already had an account, you should only use this machine
only for CS757 homework assignments or your course project.
Unless you have a longer connection for CS, this account and its storage
will be removed after the end of the semester.)
If you wish to learn more about malbec,
run /usr/platform/sun4v/sbin/prtdiag -v on malbec.
Malbec is a chip multiprocessor, with 8 8-way multi-threaded cores.
Each core has its own L1 cache and executes the threads in approximate round-robin order. The L2 cache is shared by all 8 cores (and 64 threads).
The architecture manual and this
tuning document describes additional details of this machine.
You should do this assignment alone. No late assignments.
Purpose
The purpose of this assignment is to give you experience writing a shared memory
program with pthreads to build your understanding of shared memory programming,
in general, and pthreads, in particular.
Programming Environment: Pthreads
The methods of communication and synchronization are two characteristics that
define a parallel programming model.
In a shared-memory model,
all threads access a single
address space (hence the name "shared memory"). Thus, communication
occurs implicitly, because whenever one thread writes to an address, the update
is immediately visible to all others. On the other hand, this introduces the
need for explicit synchronization to avoid races among multiple
processors to access data at the same address.
Different systems support threads in a variety of ways. For this assignment,
you will use Posix Threads, or pthreads for short. One advantage of this
particular implementation is its portability to many different systems. Do
man pthreads from malbec to see a summary of pthreads functions. You
can find details for any listed function by doing man function_name.
Note that the man page also includes information about Solaris threads. For
this assignment, you should use Posix threads, not Solaris threads.
All threaded programs begin with a single thread. When you reach a section
you wish to do in parallel, you will "fork" one or more threads to
share the work, telling each thread in which function to begin execution. Keep
in mind that the original, "main" thread continues to execute the code
following the fork. Thus, the thread model also provides "join"
functionality, which lets you tell the main thread to first wait for a child to
complete, and then merge with it.
It is likely that you will have many threads execute the same routine, and
that you will want each thread to have its own private set of local variables.
With pthreads, this occurs automatically. Only global variables are truly shared.
When you have multiple threads working on the same data, you may require
synchronization for correctness. Pthreads directly supports two techniques:
mutual exclusion (locks) and condition variables. A third technique you may
find useful is barriers. A barrier makes all threads wait for each other to
arrive at a certain point in the code before any of them can continue. While
pthreads offers no direct barrier support, you should be able to build a barrier
of your own out of the mutual exclusion primitive.
A final issue with threads is how they are mapped to processors. Obviously,
we can have more threads than processors, but someone must then time multiplex
the threads onto the processors. Since this is probably more complexity than
you want to deal with, we recommend that you directly bind each thread to a
particular processor. More precisely, on Solaris we actually bind threads to
light-weight processes (LWP's), but all you really need to know is that by
binding threads, you transfer responsibility for contention management of
threads on processors to the operating system.
Note that all source files that call the pthread API must include
the pthread.h header file. Malbec has both the Sun cc/c++ compilers
and the GNU gcc/g++ compilers. You may use whichever you choose.
When compiling with cc/c++, be sure to specify the -mt compile
flag and the -lpthread link flag (note that it's ok to pass
both flags during compilation). When using gcc or g++, you
will need to use the -lpthread link flag.
For more information:
- The pthreads example program demonstrated in class is available for review
here: tar.gz or zip.
- This
tutorial from Lawrence Livermore is also an excellent reference.
Programming Task: Parallel Sorting
Sorting is one of the most common and important tasks computers
perform. Numerous algorithms have been devised and analyzed with the
goal of sorting fast and efficiently. In this assignment, we will
focus on two particular sorting algorithms: Shell Sort and Quick Sort.
Shell Sort is a constrained version of the more widely known
insertion sort. Named for its creator Donald Shell, this sorting
algorithm has the benefit that it is remarkably easy to code, and thus
serves as a great introduction into pthread programming.
Shell Sort Algorithm (Excerpt taken from Wikipedia)
The principle of Shell sort is to rearrange the file so that
looking at every hth element yields a sorted file. We call such a file
h-sorted. If the file is then k-sorted for some other integer k, then
the file remains h-sorted. For instance, if a list was 5-sorted and
then 3-sorted, the list is now not only 3-sorted, but both 5- and
3-sorted. If this were not true, the algorithm would undo work that it
had done in previous iterations, and would not achieve such a low
running time.
The algorithm draws upon a sequence of positive integers known as
the increment sequence. Any sequence will do, as long as it ends with
1, but some sequences perform better than others. The algorithm
begins by performing a gap insertion sort, with the gap being the
first number in the increment sequence. It continues to perform a gap
insertion sort for each number in the sequence, until it finishes with
a gap of 1. When the increment reaches 1, the gap insertion sort is
simply an ordinary insertion sort, guaranteeing that the final list is
sorted. Beginning with large increments allows elements in the file to
move quickly towards their final positions, and makes it easier to
subsequently sort for smaller increments.
More information on the basics of Shell Sort can be found in
this Wikipedia
article
RadixSort Algorithm (Excerpt taken from Wikipedia)
Each key is first figuratively dropped into one level of buckets corresponding to the value of the rightmost digit. Each bucket preserves the original order of the keys as the keys are dropped into. There is a one-to-one correspondence between the number of buckets and the number of values that can be represented by a digit. Then, the process repeats with the next neighbouring digit until there are no more digits to process. In other words:
- Take the least significant digit (or group of bits, both being examples of radices) of each key.
- Group the keys based on that digit, but otherwise keep the original order of keys. (This is what makes the LSD radix sort a stable sort).
- Repeat the grouping process with each more significant digit.
The sort in step 2 is usually done using bucket sort or counting sort, which are efficient in this case since there are usually only a small number of digits. When parallelizing Radix sort in problem 4, you will probably find counting sort easier to work with.
Additional information can found in
the Wikipedia
article or any introductory algorithms book.
Problem 1: Write Sequential ShellSort
Use the program template here
for all problems in this assignment.
For this problem you are to implement the sort() method
of the ShellSorter class. The performance of Shell Sort is affected by
gap sequence; you are free to choose any sequence you like, and may
even dynamically choose it based on the input size (though this
is not required).
You can test your sort using the provided makefile and
the shellsort program that is created. shellsort
takes one required argument, -s that indicates the size of
the array to be sorted. The array is populated with random values
during initialization. You can use the -r option to seed the
random number generator for repeatable results.
Problem 2: Write Sequential Radix Sort
For this problem you are to implement the sort() method of
the RadixSorter class. Like before, you should use the provided
template to build and execute the radixsort program.
Problem 3: Write Parallel ShellSort
For this problem, you will use pthreads to parallelize your program
from Problem 1. The provided Makefile will generate
a shellsort-parallel executable that takes two
arguments. Like before, the -s argument specifies the size of
the input array. The second argument, -n, specifies the
number of worker threads to be created and used by the parallel Shell
Sort implementation.
You are free to decide how the work will be split among worker
threads. However, to ensure that you have some practice with pthread
synchronization, your parallel implementation must use at least one
barrier.
Problem 4: Write Parallel RadixSort
For this problem, you will use pthreads to parallelize your program
from Problem 2. The provided Makefile will generate
a radixsort-parallel executable that takes the same arguments
as shellsort-parallel.
Again, you are free to decide how best to distribute work among
threads. This time, however, you must use at least one mutex. You may
find the following resources useful:
Problem 5: Analysis
In this section, you will analyze the performance of your four
sorting implementations. The supplied program template already
measures and reports the execution time of your sorts; you should use
this time in this analysis (and not the shell's built in time
command). So that all results in the class are on equal footing, you
should also seed your runs with the value 1234 by using
the -r runtime option.
Part A: Plot the normalized (versus the serial version)
speedups of Programs 3 and 4 on N=[1,2,4,8,16,32,64] threads for input
of 30000000 entires. We recommend that you use multiple trials
in conjunction with some averaging strategy. Describe the procedure
you used.
Part B: Repeat Part A, but bind threads (see the processor_bind man page) to
processors in the following orders and for N=[1,2,4,8,16,32,64]:
- B-1: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63
- B-2: 0, 16, 32, 48, 1, 17, 33, 49, 2, 18, 34, 50, 3, 19, 35, 51, 4, 20, 36, 52, 5, 21, 37, 53, 6, 22, 38, 54, 7, 23, 39, 55, 8, 24, 40, 56, 9, 25, 41, 57, 10, 26, 42, 58, 11, 27, 43, 59, 12, 28, 44, 60, 13, 29, 45, 61, 14, 30, 46, 62, 15, 31, 47, 63
- B-3: 0, 8, 16, 24, 32, 40, 48, 56, 1, 9, 17, 25, 33, 41, 49, 57, 2, 10, 18, 26, 34, 42, 50, 58, 3, 11, 19, 27, 35, 43, 51, 59, 4, 12, 20, 28, 36, 44, 52, 60, 5, 13, 21, 29, 37, 45, 53, 61, 6, 14, 22, 30, 38, 46, 54, 62, 7, 15, 23, 31, 39, 47, 55, 63
Obviously, you will not use all processors in all configurations. Plot the normalized speedup of B-1 through
B-3 on a single graph. Comment on the shapes of the curves.
Part C: Plot the absolute runtimes of parallel Shell Sort and parallel Radix Sort on the same graph, using N=[1,2,4,8,16,32,64].
Tips and Tricks
- Start early.
- Make use of the demo programs provided.
- You can use /usr/platform/sun4v/sbin/prtdiag -v on malbec to learn
many useful characteristics of your host machine.
- pthreads was not designed to work well with C++ classes. You can (and should) use the helper function in the Sorter class as shadow thread body (e.g., the argument to pthread_create). That helper function will extract the object pointer (this) from the supplied SorterArgs argument and call that object's thread_body method that you will create. If you need to pass adition arguments to the real thread_body function, you can use the inhereted SorterArg classes provided in Sorters.hh
- Set up RSA authentication on malbec to save yourself some keystrokes. HowTo
- The unique identifiers returned by pthread_self() do not necessarily range
from 0 to the total number of threads.
What to Hand In
Please turn this homework in on paper at the beginning of lecture. You
must include:
- A printout of your implementation of all four sorts (i.e., RadixSorter.cc, ShellSorter.cc, and any supporting files). You do not need to include the setup code provided to you (main.cc) unless it was substantially changed.
- A brief description of the strategies you used for your implementations
- The plots as described in Problems 5a, 5b, and 5c, including labels describing your data.
- A brief explination of your results, including any anomolies you weren't expecting.
Important: Include your name on EVERY page.
|