CS 757 Spring 2011

CS 757 Computer Architecture II Spring 2011 Section 1
Instructor David A. Wood
URL: http://www.cs.wisc.edu/~david/courses/cs757/Spring2011/

Homework 1 // Due at Lecture Friday, February 18, 2011

Perform this assignment on malbec.cs.wisc.edu, a 64-thread Sun UltraSparc-T2, where we have activated a CS account for you. (Unless you already had an account, you should only use this machine only for CS757 homework assignments or your course project. Unless you have a longer connection for CS, this account and its storage will be removed after the end of the semester.)

If you wish to learn more about malbec, run /usr/platform/sun4v/sbin/prtdiag -v on malbec. Malbec is a chip multiprocessor, with 8 8-way multi-threaded cores. Each core has its own L1 cache and executes the threads in approximate round-robin order. The L2 cache is shared by all 8 cores (and 64 threads). The architecture manual and this tuning document describes additional details of this machine.

You should do this assignment alone. No late assignments.

Purpose

The purpose of this assignment is to give you experience writing a shared memory program with pthreads to build your understanding of shared memory programming, in general, and pthreads, in particular.

Programming Environment: Pthreads

The methods of communication and synchronization are two characteristics that define a parallel programming model. In a shared-memory model, all threads access a single address space (hence the name "shared memory"). Thus, communication occurs implicitly, because whenever one thread writes to an address, the update is immediately visible to all others. On the other hand, this introduces the need for explicit synchronization to avoid races among multiple processors to access data at the same address.

Different systems support threads in a variety of ways. For this assignment, you will use Posix Threads, or pthreads for short. One advantage of this particular implementation is its portability to many different systems. Do man pthreads from malbec to see a summary of pthreads functions. You can find details for any listed function by doing man function_name. Note that the man page also includes information about Solaris threads. For this assignment, you should use Posix threads, not Solaris threads.

All threaded programs begin with a single thread. When you reach a section you wish to do in parallel, you will "fork" one or more threads to share the work, telling each thread in which function to begin execution. Keep in mind that the original, "main" thread continues to execute the code following the fork. Thus, the thread model also provides "join" functionality, which lets you tell the main thread to first wait for a child to complete, and then merge with it.

It is likely that you will have many threads execute the same routine, and that you will want each thread to have its own private set of local variables. With pthreads, this occurs automatically. Only global variables are truly shared.

When you have multiple threads working on the same data, you may require synchronization for correctness. Pthreads directly supports two techniques: mutual exclusion (locks) and condition variables. A third technique you may find useful is barriers. A barrier makes all threads wait for each other to arrive at a certain point in the code before any of them can continue. While pthreads offers no direct barrier support, you should be able to build a barrier of your own out of the mutual exclusion primitive.

A final issue with threads is how they are mapped to processors. Obviously, we can have more threads than processors, but someone must then time multiplex the threads onto the processors. Since this is probably more complexity than you want to deal with, we recommend that you directly bind each thread to a particular processor. More precisely, on Solaris we actually bind threads to light-weight processes (LWP's), but all you really need to know is that by binding threads, you transfer responsibility for contention management of threads on processors to the operating system.

Note that all source files that call the pthread API must include the pthread.h header file. Malbec has both the Sun cc/c++ compilers and the GNU gcc/g++ compilers. You may use whichever you choose. When compiling with cc/c++, be sure to specify the -mt compile flag and the -lpthread link flag (note that it's ok to pass both flags during compilation). When using gcc or g++, you will need to use the -lpthread link flag.

For more information:

The pthreads example program demonstrated in class is available for review here: tar.gz or zip.
This tutorial from Lawrence Livermore is also an excellent reference.

Programming Task: Parallel Sorting

Sorting is one of the most common and important tasks computers perform. Numerous algorithms have been devised and analyzed with the goal of sorting fast and efficiently. In this assignment, we will focus on two particular sorting algorithms: Shell Sort and Quick Sort.

Shell Sort is a constrained version of the more widely known insertion sort. Named for its creator Donald Shell, this sorting algorithm has the benefit that it is remarkably easy to code, and thus serves as a great introduction into pthread programming.

Shell Sort Algorithm (Excerpt taken from Wikipedia)

The principle of Shell sort is to rearrange the file so that looking at every hth element yields a sorted file. We call such a file h-sorted. If the file is then k-sorted for some other integer k, then the file remains h-sorted. For instance, if a list was 5-sorted and then 3-sorted, the list is now not only 3-sorted, but both 5- and 3-sorted. If this were not true, the algorithm would undo work that it had done in previous iterations, and would not achieve such a low running time.

The algorithm draws upon a sequence of positive integers known as the increment sequence. Any sequence will do, as long as it ends with 1, but some sequences perform better than others. The algorithm begins by performing a gap insertion sort, with the gap being the first number in the increment sequence. It continues to perform a gap insertion sort for each number in the sequence, until it finishes with a gap of 1. When the increment reaches 1, the gap insertion sort is simply an ordinary insertion sort, guaranteeing that the final list is sorted. Beginning with large increments allows elements in the file to move quickly towards their final positions, and makes it easier to subsequently sort for smaller increments.

More information on the basics of Shell Sort can be found in this Wikipedia article

RadixSort Algorithm (Excerpt taken from Wikipedia)

Each key is first figuratively dropped into one level of buckets corresponding to the value of the rightmost digit. Each bucket preserves the original order of the keys as the keys are dropped into. There is a one-to-one correspondence between the number of buckets and the number of values that can be represented by a digit. Then, the process repeats with the next neighbouring digit until there are no more digits to process. In other words:

Take the least significant digit (or group of bits, both being examples of radices) of each key.
Group the keys based on that digit, but otherwise keep the original order of keys. (This is what makes the LSD radix sort a stable sort).
Repeat the grouping process with each more significant digit.

The sort in step 2 is usually done using bucket sort or counting sort, which are efficient in this case since there are usually only a small number of digits. When parallelizing Radix sort in problem 4, you will probably find counting sort easier to work with.

Additional information can found in the Wikipedia article or any introductory algorithms book.

Problem 1: Write Sequential ShellSort

Use the program template here for all problems in this assignment.

For this problem you are to implement the sort() method of the ShellSorter class. The performance of Shell Sort is affected by gap sequence; you are free to choose any sequence you like, and may even dynamically choose it based on the input size (though this is not required).

You can test your sort using the provided makefile and the shellsort program that is created. shellsort takes one required argument, -s that indicates the size of the array to be sorted. The array is populated with random values during initialization. You can use the -r option to seed the random number generator for repeatable results.

Problem 2: Write Sequential Radix Sort

For this problem you are to implement the sort() method of the RadixSorter class. Like before, you should use the provided template to build and execute the radixsort program.

Problem 3: Write Parallel ShellSort

For this problem, you will use pthreads to parallelize your program from Problem 1. The provided Makefile will generate a shellsort-parallel executable that takes two arguments. Like before, the -s argument specifies the size of the input array. The second argument, -n, specifies the number of worker threads to be created and used by the parallel Shell Sort implementation.

You are free to decide how the work will be split among worker threads. However, to ensure that you have some practice with pthread synchronization, your parallel implementation must use at least one barrier.

Problem 4: Write Parallel RadixSort

For this problem, you will use pthreads to parallelize your program from Problem 2. The provided Makefile will generate a radixsort-parallel executable that takes the same arguments as shellsort-parallel.

Again, you are free to decide how best to distribute work among threads. This time, however, you must use at least one mutex. You may find the following resources useful:

Problem 5: Analysis

In this section, you will analyze the performance of your four sorting implementations. The supplied program template already measures and reports the execution time of your sorts; you should use this time in this analysis (and not the shell's built in time command). So that all results in the class are on equal footing, you should also seed your runs with the value 1234 by using the -r runtime option.

Part A: Plot the normalized (versus the serial version) speedups of Programs 3 and 4 on N=[1,2,4,8,16,32,64] threads for input of 30000000 entires. We recommend that you use multiple trials in conjunction with some averaging strategy. Describe the procedure you used.

Part B: Repeat Part A, but bind threads (see the processor_bind man page) to processors in the following orders and for N=[1,2,4,8,16,32,64]:

B-1: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63
B-2: 0, 16, 32, 48, 1, 17, 33, 49, 2, 18, 34, 50, 3, 19, 35, 51, 4, 20, 36, 52, 5, 21, 37, 53, 6, 22, 38, 54, 7, 23, 39, 55, 8, 24, 40, 56, 9, 25, 41, 57, 10, 26, 42, 58, 11, 27, 43, 59, 12, 28, 44, 60, 13, 29, 45, 61, 14, 30, 46, 62, 15, 31, 47, 63
B-3: 0, 8, 16, 24, 32, 40, 48, 56, 1, 9, 17, 25, 33, 41, 49, 57, 2, 10, 18, 26, 34, 42, 50, 58, 3, 11, 19, 27, 35, 43, 51, 59, 4, 12, 20, 28, 36, 44, 52, 60, 5, 13, 21, 29, 37, 45, 53, 61, 6, 14, 22, 30, 38, 46, 54, 62, 7, 15, 23, 31, 39, 47, 55, 63

Obviously, you will not use all processors in all configurations. Plot the normalized speedup of B-1 through B-3 on a single graph. Comment on the shapes of the curves.

Part C: Plot the absolute runtimes of parallel Shell Sort and parallel Radix Sort on the same graph, using N=[1,2,4,8,16,32,64].

Tips and Tricks

Start early.
Make use of the demo programs provided.
You can use /usr/platform/sun4v/sbin/prtdiag -v on malbec to learn many useful characteristics of your host machine.
pthreads was not designed to work well with C++ classes. You can (and should) use the helper function in the Sorter class as shadow thread body (e.g., the argument to pthread_create). That helper function will extract the object pointer (this) from the supplied SorterArgs argument and call that object's thread_body method that you will create. If you need to pass adition arguments to the real thread_body function, you can use the inhereted SorterArg classes provided in Sorters.hh
Set up RSA authentication on malbec to save yourself some keystrokes. HowTo
The unique identifiers returned by pthread_self() do not necessarily range from 0 to the total number of threads.

What to Hand In

Please turn this homework in on paper at the beginning of lecture. You must include:

A printout of your implementation of all four sorts (i.e., RadixSorter.cc, ShellSorter.cc, and any supporting files). You do not need to include the setup code provided to you (main.cc) unless it was substantially changed.
A brief description of the strategies you used for your implementations
The plots as described in Problems 5a, 5b, and 5c, including labels describing your data.
A brief explination of your results, including any anomolies you weren't expecting.

Important: Include your name on EVERY page.

Computer Sciences | UW Home

Page last modified: Tuesday, 08-Feb-2011 18:09:37 CST
Feedback or content questions: send email to david [at] cs.wisc.edu
Technical or accessibility issues: lab@cs.wisc.edu
Copyright © 2002, 2003 The Board of Regents of the University of Wisconsin System.

CS 757 Computer Architecture II Spring 2011 Section 1 Instructor David A. Wood URL: http://www.cs.wisc.edu/~david/courses/cs757/Spring2011/