Homework 2 // Due at Lecture Friday, February 18, 2011
Perform this assignment on malbec.cs.wisc.edu,
a 64-thread Sun UltraSparc-T2, where we have activated a CS account for you.
(Unless you already had an account, you should only use this machine
only for CS757 homework assignments or your course project.
Unless you have a longer connection for CS, this account and its storage
will be removed after the end of the semester.)
If you wish to learn more about malbec,
run /usr/platform/sun4v/sbin/prtdiag -v on malbec.
Malbec is a chip multiprocessor, with 8 8-way multi-threaded cores.
Each core has its own L1 cache and executes the threads in approximate round-robin order. The L2 cache is shared by all 8 cores (and 64 threads).
The architecture manual and this
tuning document describes additional details of this machine.
You should do this assignment alone. No late assignments.
Purpose
The purpose of this assignment is to give you experience writing a shared memory
program with pthreads to build your understanding of shared memory programming,
in general, and pthreads, in particular.
Programming Environment: Pthreads
The methods of communication and synchronization are two characteristics that
define a parallel programming model.
In a shared-memory model,
all threads access a single
address space (hence the name "shared memory"). Thus, communication
occurs implicitly, because whenever one thread writes to an address, the update
is immediately visible to all others. On the other hand, this introduces the
need for explicit synchronization to avoid races among multiple
processors to access data at the same address.
Different systems support threads in a variety of ways. For this assignment,
you will use Posix Threads, or pthreads for short. One advantage of this
particular implementation is its portability to many different systems. Do
man pthreads from malbec to see a summary of pthreads functions. You
can find details for any listed function by doing man function_name.
Note that the man page also includes information about Solaris threads. For
this assignment, you should use Posix threads, not Solaris threads.
All threaded programs begin with a single thread. When you reach a section
you wish to do in parallel, you will "fork" one or more threads to
share the work, telling each thread in which function to begin execution. Keep
in mind that the original, "main" thread continues to execute the code
following the fork. Thus, the thread model also provides "join"
functionality, which lets you tell the main thread to first wait for a child to
complete, and then merge with it.
It is likely that you will have many threads execute the same routine, and
that you will want each thread to have its own private set of local variables.
With pthreads, this occurs automatically. Only global variables are truly shared.
When you have multiple threads working on the same data, you may require
synchronization for correctness. Pthreads directly supports two techniques:
mutual exclusion (locks) and condition variables. A third technique you may
find useful is barriers. A barrier makes all threads wait for each other to
arrive at a certain point in the code before any of them can continue. While
pthreads offers no direct barrier support, you should be able to build a barrier
of your own out of the mutual exclusion primitive.
A final issue with threads is how they are mapped to processors. Obviously,
we can have more threads than processors, but someone must then time multiplex
the threads onto the processors. Since this is probably more complexity than
you want to deal with, we recommend that you directly bind each thread to a
particular processor. More precisely, on Solaris we actually bind threads to
light-weight processes (LWP's), but all you really need to know is that by
binding threads, you transfer responsibility for contention management of
threads on processors to the operating system.
Note that all source files that call the pthread API must include
the pthread.h header file. Malbec has both the Sun cc/c++ compilers
and the GNU gcc/g++ compilers. You may use whichever you choose.
When compiling with cc/c++, be sure to specify the -mt compile
flag and the -lpthread link flag (note that it's ok to pass
both flags during compilation). When using gcc or g++, you
will need to use the -lpthread link flag.
For more information:
- The pthreads example program demonstrated in class is available for review
here: tar.gz or zip.
- This
tutorial from Lawrence Livermore is also an excellent reference.
Programming Task: N-Body Simulation
An n-body simulation calculates the gravitational effects of the masses of n
bodies on each others' positions and velocities. The final values are generated
by incrementally updating the bodies over many small time-steps. This involves
calculating the pairwise force exerted on each particle by all other particles,
an O(n2) operation. For simplicity, we will model the bodies in a
two-dimensional space.
The physics.
We review the equations governing the motion of the particles according to
Newton's laws of motion and gravitation. Don't worry if your physics is a bit
rusty; all of the necessary formulas are included below. We already know each
particle's position (rx, ry) and velocity
(vx, vy). To model the dynamics of the system,
we must determine the net force exerted on each particle.
-
Pairwise force.
Newton's law of universal gravitation asserts that
the strength of the gravitational force between two particles is given by
the product of their masses divided by the square of the distance
between them, scaled by the gravitational constant G, which is 6.67 × 10-11
N m2 / kg2.
The pull of one particle towards another acts on the line between them.
Since we will be using Cartesian coordinates to represent the position of
a particle, it
is convenient to break up the force into its x and y components
(Fx, Fy) as illustrated below.
-
Net force.
The principle of superposition says that
the net force acting on a particle in the x or y direction is the sum
of the pairwise forces acting on the particle in that direction.
-
Acceleration.
Newton's second law of motion postulates that
the accelerations in the x and y directions are given by:
ax = Fx / m, ay = Fy / m.
The numerics.
We use the leapfrog finite difference approximation scheme
to numerically integrate the above equations: this is the
basis for most astrophysical simulations of gravitational systems.
In the leapfrog scheme, we discretize time, and update the time
variable t in increments of the time quantum Δt.
We maintain the position and velocity of each particle, but they are half a
time step out of phase (which explains the name leapfrog). The steps below illustrate
how to evolve the positions and velocities of the particles.
For each particle:
-
Calculate the net force acting on it at time t using Newton's
law of gravitation and the principle of superposition.
-
Calculate its acceleration (ax, ay) at time t
using its force at time t and Newton's second law of motion.
-
Calculate its velocity at time t + Δt / 2 by using
its acceleration at time t and its velocity
(vx, vy)
at time t - Δt / 2.
Assume that the acceleration remains constant in this interval, so that the updated
velocity is:
vx = vx + Δt ax,
vy = vy + Δt ay.
-
Calculate its position at time t + Δt by using
its velocity at time t + Δt / 2 and its position
at time t.
Assume that the velocity remains constant in the interval from
t to t + Δt, so that
the resulting position is given by
rx = rx + Δt vx,
ry = ry + Δt vy.
Note that because of the leapfrog scheme, the constant velocity we are using
is the one estimated at the middle of the interval rather than either of the
endpoints.
As you would expect, the simulation is more accurate when Δt is
very small, but this comes at the price of more computation.
Problem 1: Write Sequential N-Body
Write a single-threaded solution to the n-body problem described above. Your
program should take as input the number of particles, the length of the time-step
Δt, and the number of these time-steps to execute. Your final
version should randomly generate the mass, x and y Cartesian
coordinates, and x and y velocity components for each
particle, then simulate for the specified number of time-steps.
For the randomly generated values, you may use any range that you like.
However, if you have very large masses which happen to be initialized to
positions that are very close to each other, you may encounter overflow. Since
the focus of this assignment is parallel programming and scalability, you will
not be graded on elegantly handling or avoiding such situations.
You are required to make an argument that your n-body implementation is
correct. A good way to do this is to initialize the bodies to a special-case
starting condition, and then show that after a number of time steps the state of
the grid exhibits symmetry or some other expected property. You need not prove
your implementation's correctness in the literal sense. However, please
annotate any simulation outputs clearly.
Problem 2: Write Pthreads N-Body
For this problem, you will use pthreads to parallelize your program from
Problem 1. This program should take an additional command-line argument: the
number of threads to use in the parallel section.
It is only required that you parallelize the main portion of the simulation,
but parallelizing the data generation phase of your n-body program is also
worthwhile. You will not be penalized if you choose not to parallelize the
initialization phase.
To ensure that you have some practice with mutual exclusion, your parallel
n-body program must use at least one lock. One way to satisfy this requirement
is to implement your own barrier.
For simplicity, you may assume that the number of bodies is evenly divisible
by the number of threads.
Problem 3: Analysis of N-Body
In this section, you will analyze the performance of your two N-Body
implementations.
Modify your programs to measure the execution time of the parallel
phase of execution. If you chose to parallelize the data generation phase,
you may include that as well. Either way, make it clear exactly what you are
measuring. Use of Unix's gethrtime() is recommended. Do not use the
shell's built-in time command.
Part A: Plot the normalized (versus the serial version of n-body) speedups of Program
2 on N=[1,2,4,8,16,32,64] threads for 512 bodies and 100 time steps. The value
of dt is irrelevant to studying scalability: with the number of time steps held
constant, it only affects the length of time simulated, not the duration of the
simulation itself. Thus you may choose any value you like. We recommend that
you use multiple trials in conjunction with some averaging strategy. Describe
the procedure you used.
Part B: Repeat Part A, but bind threads (see the processor_bind man page) to
processors in the following orders and for N=[1,2,4,8,16,32,64]:
- B-1: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63
- B-2: 0, 16, 32, 48, 1, 17, 33, 49, 2, 18, 34, 50, 3, 19, 35, 51, 4, 20, 36, 52, 5, 21, 37, 53, 6, 22, 38, 54, 7, 23, 39, 55, 8, 24, 40, 56, 9, 25, 41, 57, 10, 26, 42, 58, 11, 27, 43, 59, 12, 28, 44, 60, 13, 29, 45, 61, 14, 30, 46, 62, 15, 31, 47, 63
- B-3: 0, 8, 16, 24, 32, 40, 48, 56, 1, 9, 17, 25, 33, 41, 49, 57, 2, 10, 18, 26, 34, 42, 50, 58, 3, 11, 19, 27, 35, 43, 51, 59, 4, 12, 20, 28, 36, 44, 52, 60, 5, 13, 21, 29, 37, 45, 53, 61, 6, 14, 22, 30, 38, 46, 54, 62, 7, 15, 23, 31, 39, 47, 55, 63
Obviously, you will not use all processors in all configurations. Plot the normalized speedup of B-1 through
B-3 on a single graph. Comment on the shapes of the curves.
Note: Lots of users trying to bind threads at the same time leads to no forward progress for any of you.... Please plan to run your programs well ahead of the deadline.
Tips and Tricks
- Start early.
- Make use of the demo programs provided.
- You can use /usr/platform/sun4v/sbin/prtdiag -v on malbec to learn
many useful characteristics of your host machine.
- Check out the pthreads (and OpenMP) examples here
for help with syntax and environment setup.
- Set up RSA authentication on malbec to save yourself some keystrokes. HowTo
- When scrutinizing your program for correctness, remember that too large a
time-step can dramatically affect your results.
- The unique identifiers returned by pthread_self() do not necessarily range
from 0 to the total number of threads.
- If you parallelize the data generation phase and compare the results to your
sequential program, even using the same random seed, why might the results differ?
What to Hand In
This assignment will be peer reviewed. Please bring two copies of your homework to lecture; you will give these to the two NEWLY ASSIGNED peer review members.
Your answers to the discussion questions should be typed up, handwritten
notes are not acceptable.
A tarball of your entire source code including a Makefile and a README file,
should be emailed to your peer group members before the beginning of lecture.
The README should include 1) directions for compiling your code, 2) directions for
running your code, 3) any other comments.
Use subject line [CS/ECE 757] Homework 1,
so that email filters work properly.
You must include:
- A printout of the simulation phase of Program 1.
- A printout of the parallel phase of Program 2.
- Arguments for the correctness of Programs 1 and 2.
- The plots as described in Problem 3a and 3b, including labels describing your data.
|