CS 757 Spring 2011

CS 757 Computer Architecture II Spring 2011 Section 1
Instructor David A. Wood
URL: http://www.cs.wisc.edu/~david/courses/cs757/Spring2011/

Homework 2 // Due at Lecture Friday, February 18, 2011

Perform this assignment on malbec.cs.wisc.edu, a 64-thread Sun UltraSparc-T2, where we have activated a CS account for you. (Unless you already had an account, you should only use this machine only for CS757 homework assignments or your course project. Unless you have a longer connection for CS, this account and its storage will be removed after the end of the semester.)

If you wish to learn more about malbec, run /usr/platform/sun4v/sbin/prtdiag -v on malbec. Malbec is a chip multiprocessor, with 8 8-way multi-threaded cores. Each core has its own L1 cache and executes the threads in approximate round-robin order. The L2 cache is shared by all 8 cores (and 64 threads). The architecture manual and this tuning document describes additional details of this machine.

You should do this assignment alone. No late assignments.

Purpose

The purpose of this assignment is to give you experience writing a shared memory program with pthreads to build your understanding of shared memory programming, in general, and pthreads, in particular.

Programming Environment: Pthreads

The methods of communication and synchronization are two characteristics that define a parallel programming model. In a shared-memory model, all threads access a single address space (hence the name "shared memory"). Thus, communication occurs implicitly, because whenever one thread writes to an address, the update is immediately visible to all others. On the other hand, this introduces the need for explicit synchronization to avoid races among multiple processors to access data at the same address.

Different systems support threads in a variety of ways. For this assignment, you will use Posix Threads, or pthreads for short. One advantage of this particular implementation is its portability to many different systems. Do man pthreads from malbec to see a summary of pthreads functions. You can find details for any listed function by doing man function_name. Note that the man page also includes information about Solaris threads. For this assignment, you should use Posix threads, not Solaris threads.

All threaded programs begin with a single thread. When you reach a section you wish to do in parallel, you will "fork" one or more threads to share the work, telling each thread in which function to begin execution. Keep in mind that the original, "main" thread continues to execute the code following the fork. Thus, the thread model also provides "join" functionality, which lets you tell the main thread to first wait for a child to complete, and then merge with it.

It is likely that you will have many threads execute the same routine, and that you will want each thread to have its own private set of local variables. With pthreads, this occurs automatically. Only global variables are truly shared.

When you have multiple threads working on the same data, you may require synchronization for correctness. Pthreads directly supports two techniques: mutual exclusion (locks) and condition variables. A third technique you may find useful is barriers. A barrier makes all threads wait for each other to arrive at a certain point in the code before any of them can continue. While pthreads offers no direct barrier support, you should be able to build a barrier of your own out of the mutual exclusion primitive.

A final issue with threads is how they are mapped to processors. Obviously, we can have more threads than processors, but someone must then time multiplex the threads onto the processors. Since this is probably more complexity than you want to deal with, we recommend that you directly bind each thread to a particular processor. More precisely, on Solaris we actually bind threads to light-weight processes (LWP's), but all you really need to know is that by binding threads, you transfer responsibility for contention management of threads on processors to the operating system.

Note that all source files that call the pthread API must include the pthread.h header file. Malbec has both the Sun cc/c++ compilers and the GNU gcc/g++ compilers. You may use whichever you choose. When compiling with cc/c++, be sure to specify the -mt compile flag and the -lpthread link flag (note that it's ok to pass both flags during compilation). When using gcc or g++, you will need to use the -lpthread link flag.

For more information:

The pthreads example program demonstrated in class is available for review here: tar.gz or zip.
This tutorial from Lawrence Livermore is also an excellent reference.

Programming Task: N-Body Simulation

An n-body simulation calculates the gravitational effects of the masses of n bodies on each others' positions and velocities. The final values are generated by incrementally updating the bodies over many small time-steps. This involves calculating the pairwise force exerted on each particle by all other particles, an O(n²) operation. For simplicity, we will model the bodies in a two-dimensional space.

The physics. We review the equations governing the motion of the particles according to Newton's laws of motion and gravitation. Don't worry if your physics is a bit rusty; all of the necessary formulas are included below. We already know each particle's position (r_x, r_y) and velocity (v_x, v_y). To model the dynamics of the system, we must determine the net force exerted on each particle.

Pairwise force. Newton's law of universal gravitation asserts that the strength of the gravitational force between two particles is given by the product of their masses divided by the square of the distance between them, scaled by the gravitational constant G, which is 6.67 × 10^-11 N m² / kg². The pull of one particle towards another acts on the line between them. Since we will be using Cartesian coordinates to represent the position of a particle, it is convenient to break up the force into its x and y components (F_x, F_y) as illustrated below.
Net force. The principle of superposition says that the net force acting on a particle in the x or y direction is the sum of the pairwise forces acting on the particle in that direction.
Acceleration. Newton's second law of motion postulates that the accelerations in the x and y directions are given by: a_x = F_x / m, a_y = F_y / m.

The numerics. We use the leapfrog finite difference approximation scheme to numerically integrate the above equations: this is the basis for most astrophysical simulations of gravitational systems. In the leapfrog scheme, we discretize time, and update the time variable t in increments of the time quantum Δt. We maintain the position and velocity of each particle, but they are half a time step out of phase (which explains the name leapfrog). The steps below illustrate how to evolve the positions and velocities of the particles.

For each particle:

Calculate the net force acting on it at time t using Newton's law of gravitation and the principle of superposition.
Calculate its acceleration (a_x, a_y) at time t using its force at time t and Newton's second law of motion.
Calculate its velocity at time t + Δt / 2 by using its acceleration at time t and its velocity (v_x, v_y) at time t - Δt / 2. Assume that the acceleration remains constant in this interval, so that the updated velocity is: v_x = v_x + Δt a_x, v_y = v_y + Δt a_y.
Calculate its position at time t + Δt by using its velocity at time t + Δt / 2 and its position at time t. Assume that the velocity remains constant in the interval from t to t + Δt, so that the resulting position is given by r_x = r_x + Δt v_x, r_y = r_y + Δt v_y. Note that because of the leapfrog scheme, the constant velocity we are using is the one estimated at the middle of the interval rather than either of the endpoints.

As you would expect, the simulation is more accurate when Δt is very small, but this comes at the price of more computation.

Problem 1: Write Sequential N-Body

Write a single-threaded solution to the n-body problem described above. Your program should take as input the number of particles, the length of the time-step Δt, and the number of these time-steps to execute. Your final version should randomly generate the mass, x and y Cartesian coordinates, and x and y velocity components for each particle, then simulate for the specified number of time-steps.

For the randomly generated values, you may use any range that you like. However, if you have very large masses which happen to be initialized to positions that are very close to each other, you may encounter overflow. Since the focus of this assignment is parallel programming and scalability, you will not be graded on elegantly handling or avoiding such situations.

You are required to make an argument that your n-body implementation is correct. A good way to do this is to initialize the bodies to a special-case starting condition, and then show that after a number of time steps the state of the grid exhibits symmetry or some other expected property. You need not prove your implementation's correctness in the literal sense. However, please annotate any simulation outputs clearly.

Problem 2: Write Pthreads N-Body

For this problem, you will use pthreads to parallelize your program from Problem 1. This program should take an additional command-line argument: the number of threads to use in the parallel section.

It is only required that you parallelize the main portion of the simulation, but parallelizing the data generation phase of your n-body program is also worthwhile. You will not be penalized if you choose not to parallelize the initialization phase.

To ensure that you have some practice with mutual exclusion, your parallel n-body program must use at least one lock. One way to satisfy this requirement is to implement your own barrier.

For simplicity, you may assume that the number of bodies is evenly divisible by the number of threads.

Problem 3: Analysis of N-Body

In this section, you will analyze the performance of your two N-Body implementations.

Modify your programs to measure the execution time of the parallel phase of execution. If you chose to parallelize the data generation phase, you may include that as well. Either way, make it clear exactly what you are measuring. Use of Unix's gethrtime() is recommended. Do not use the shell's built-in time command.

Part A: Plot the normalized (versus the serial version of n-body) speedups of Program 2 on N=[1,2,4,8,16,32,64] threads for 512 bodies and 100 time steps. The value of dt is irrelevant to studying scalability: with the number of time steps held constant, it only affects the length of time simulated, not the duration of the simulation itself. Thus you may choose any value you like. We recommend that you use multiple trials in conjunction with some averaging strategy. Describe the procedure you used.

Part B: Repeat Part A, but bind threads (see the processor_bind man page) to processors in the following orders and for N=[1,2,4,8,16,32,64]:

B-1: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63
B-2: 0, 16, 32, 48, 1, 17, 33, 49, 2, 18, 34, 50, 3, 19, 35, 51, 4, 20, 36, 52, 5, 21, 37, 53, 6, 22, 38, 54, 7, 23, 39, 55, 8, 24, 40, 56, 9, 25, 41, 57, 10, 26, 42, 58, 11, 27, 43, 59, 12, 28, 44, 60, 13, 29, 45, 61, 14, 30, 46, 62, 15, 31, 47, 63
B-3: 0, 8, 16, 24, 32, 40, 48, 56, 1, 9, 17, 25, 33, 41, 49, 57, 2, 10, 18, 26, 34, 42, 50, 58, 3, 11, 19, 27, 35, 43, 51, 59, 4, 12, 20, 28, 36, 44, 52, 60, 5, 13, 21, 29, 37, 45, 53, 61, 6, 14, 22, 30, 38, 46, 54, 62, 7, 15, 23, 31, 39, 47, 55, 63

Obviously, you will not use all processors in all configurations. Plot the normalized speedup of B-1 through B-3 on a single graph. Comment on the shapes of the curves.

Note: Lots of users trying to bind threads at the same time leads to no forward progress for any of you.... Please plan to run your programs well ahead of the deadline.

Tips and Tricks

Start early.
Make use of the demo programs provided.
You can use /usr/platform/sun4v/sbin/prtdiag -v on malbec to learn many useful characteristics of your host machine.
Check out the pthreads (and OpenMP) examples here for help with syntax and environment setup.
Set up RSA authentication on malbec to save yourself some keystrokes. HowTo
When scrutinizing your program for correctness, remember that too large a time-step can dramatically affect your results.
The unique identifiers returned by pthread_self() do not necessarily range from 0 to the total number of threads.
If you parallelize the data generation phase and compare the results to your sequential program, even using the same random seed, why might the results differ?

What to Hand In

This assignment will be peer reviewed. Please bring two copies of your homework to lecture; you will give these to the two NEWLY ASSIGNED peer review members. Your answers to the discussion questions should be typed up, handwritten notes are not acceptable. A tarball of your entire source code including a Makefile and a README file, should be emailed to your peer group members before the beginning of lecture. The README should include 1) directions for compiling your code, 2) directions for running your code, 3) any other comments. Use subject line [CS/ECE 757] Homework 1, so that email filters work properly.

You must include:

A printout of the simulation phase of Program 1.
A printout of the parallel phase of Program 2.
Arguments for the correctness of Programs 1 and 2.
The plots as described in Problem 3a and 3b, including labels describing your data.

Computer Sciences | UW Home

Page last modified: Tuesday, 08-Feb-2011 18:16:11 CST
Feedback or content questions: send email to david [at] cs.wisc.edu
Technical or accessibility issues: lab@cs.wisc.edu
Copyright © 2002, 2003 The Board of Regents of the University of Wisconsin System.

CS 757 Computer Architecture II Spring 2011 Section 1 Instructor David A. Wood URL: http://www.cs.wisc.edu/~david/courses/cs757/Spring2011/