Homework 3 // Due at Lecture Monday, October 5, 2009
You will perform this assignment on two architectures: eight-core,
eight-way threaded (64 threads total) Sun UltraSPARC-T2
(gamay.cs.wisc.edu), and dual socket quad-core two-way
threaded (16 threads total) Intel Nehalem processors
(ale-01.cs.wisc.edu and ale-02.cs.wisc.edu).
As always, if you wish to learn more about gamay, run
/usr/platform/sun4v/sbin/prtdiag -v on gamay. The architecture
manual and this tuning
document describes additional details of this machine.
You may find similar information on the x86-64 Linux machines by examining the /proc/cpuinfo file.
You should do this assignment alone. No late assignments.
Purpose
The purpose of this assignment is to give you experience writing a
shared-memory program with OpenMP and Intel's Thread Building Blocks
to build your understanding of shared-memory programming, in general,
and OpenMP/TBB, in particular.
Programming Environment 1: OpenMP
OpenMP is a shared-memory programming model that attempts to
automatically parallelize code that was written in a (mostly) serial
fashion. OpenMP makes extensive use of compiler directives and
optimizations, in addition to its own runtime library.
If you have not already done so, it is suggested that you review
the OpenMP references provided in the Reading List.
OpenMP uses a Fork/Join model similar to that of P-Threads, but
Fork/Join events are more frequent in OpenMP than in most P-Thread
based programs. Most OpenMP programs consist of
interleaved parallel and sequential sections, with "Fork" events
occurring at the start of each parallel section, and "Join" events
at the end of each parallel section. In non-parallel sections,
only the "master thread" executes.
In order to use the OpenMP environment on Gamay, students may use the cc complier
machines. Any source files that employ OpenMP directives or
library calls must include the omp.h header file. Additionally,
the flag -xopenmp must be passed to cc for both compilation and
linking. It is recommended that students also pass the -xO3
flag to cc, to avoid optimization-related warnings.
In order to use the OpenMP environment on the ale nodes, you should compile your
program with gcc or g++ and provide the -fopenmp
for both the compile and link steps.
A set of OpenMP example programs are available for review here.
Programming Environment 2: TBB
Intel's Thread Building
Blocks (TBB) package provides a host of useful services to the
parallel programmer, including some of the same loop parallelization
options provided by OpenMP (with different syntax, of course). Intel
provides a handy Getting Started Guide that is available at the
link above under the Documentation tab, which will show you
everything you need to know about TBB for the purposes of this
assignment.
A tutorial on TBB's loop parallelization is available here. It will guide you through TBB setup
and a brief illustrative example.
Programming Task: Ocean Simulation
OCEAN is a simulation of large-scale sea conditions from the SPLASH
benchmark suite. It is a scientific workload used for performance
evaluation of parallel machines. For this assignment, you will write
three scaled-down versions of the Ocean benchmark.
Ocean is briefly described in Woo et al. on the Reading List. The
scaled-down version you will implement is described below.
Our version of Ocean will simulate water temperatures using a
large grid of floating-point values over a fixed number of time
steps (use type float, or equivalent, for
each cell in the ocean). At each time step, a given grid location
will be averaged with its immediate north, south, east, and west
neighbors to determine the value of that grid location in the next
time step. The edges of the grid do not participate in the averaging
process (they contribute a value, but their value does not change).
Thus, Ocean will converge (given sufficient runtime) to a gradient of
the water temperatures on the perimeter of the grid.
This averaging pattern is repeated for each grid location at each time
step until the simulation terminates.
Note that simple iteration over the grid, top-to-bottom, right-to-left,
will cause the simulation to skew to the top-left corner of the grid,
if each location is updated on-the-fly. A simple way to counter this
effect is to maintain a shadow copy of the grid, and swap between the
two copies on every time step.
Problem 1: Write Sequential Ocean
Write a single-threaded (sequential) version of Ocean as described
above. This version of Ocean must take three arguments: the
x-dimension of the grid, the y-dimension of the grid, and the number
of time-steps. You may assume for simplicity that all grid sizes
will be powers of two plus two (the area of the grid that will be
modified will therefore be sized to powers of two).
You are required to make an argument that your implementation of
Ocean is correct. A good way to do this is to initialize the grid
to a special-case starting condition, and then show that after a number
of time steps the state of the grid exhibits symmetry or some other
expected property. You need not prove your implementation's
correctness in the literal since. However, please annotate any
simulation outputs clearly.
Your final sequential version of Ocean should randomly initialize a
grid of the requested size, then perform simulation for the specified
number of time steps.
It is in your best interests to write code that is friendly to
both GCC and G++, as the OpenMP environment is C-based, where
TBB is C++-based. Your code should compile cleanly on gamay
AND on the ale machines.
Problem 2: Write OpenMP Ocean
You will implement two parallelized versions of Ocean, described
below. The first uses OpenMP to parallelize your source from Problem 1.
As with Problem 1, you are required to make an argument for
the correctness of your implementation (it is acceptable to use the
same argument, provided it is still applicable).
For this problem, you will use OpenMP directives to parallelize your
program from Problem 1. This program should take an additional
command-line argument: the number of threads to use in parallel
sections.
For this implementation, you are encouraged to experiment use the
schedule clause on loops that you will parallelize
with OpenMP. This will influence the OpenMP runtime's iteration scheduling.
It is only required that you parallelize the main portion of the
simulation, but parallelizing the initialization phase of Ocean is
also worthwhile. You will not be penalized if you choose not to
parallelize the initialization phase.
We will compare the scalability and
raw performance of this program to the TBB implementation of
Problem 3.
Problem 3: Write TBB Ocean
Starting again with your code from Problem 1, re-parallelize your code
with Intel's Thread Building Blocks. We will compare the
scalability and raw performance of this program to the
OpenMP implementation of Problem 2. This program should take the same
parameters as the program from problem 1.
As with Problem 2, you are encouraged to use this assignment
to explore the features of TBB, though you will only need TBB's
loop parallelization constructs.
Problem 4: Analysis of Ocean
In this section, we will analyze the performance of our three Ocean
implementations. We will use a fixed number of time steps (100).
Modify your programs to measure the execution time of the
parallel phase of execution. Use of Unix's gethrtime(), gettimeofday(), or clock_gettime
is recommended. Do not use the shell's built-in time command.
Plot the normalized (versus the serial version of Ocean for
both gamay and ale platforms for the OpenMP version, and on ale only for the TBB version) speedups of Programs 2 and 3 on
N=[1,2,3,4,5,6,7,8] for a 514x514 ocean. Repeat (on the same
graph) for an ocean sized to 1026x1026.
Plot the total runtime of Programs 2 and 3 on N=[1,2,4,8,16,32,64] for a 514x514 ocean. Repeat (on the same graph) for an ocean sized to 1026x1026. The
N=1 case should be the serial version of Ocean, not the parallel version using only 1 thread.
Problem 5: Questions (Submission Credit)
- Which configuration had the best overall performance? Comment on the performance differences between the OpenMP implementation of Problem 2 and the TBB implementation of Problem 3.
- Which configuration had the best overall scalability?
Comment on the scalability differences between the
OpenMP implementation of Problem 2 and the TBB implementation
of Problem 3.
- Comment on the overall performance as well as the
scalability of the Sun Niagara 2 (gamay) platform
vs. the Intel Nehalem (ale) platform.
- Comment on which programming environment you prefer.
Tips and Tricks
Start early.
Set up RSA authentication on gamay to save yourself some keystrokes. HowTo
Check out the pthreads and OpenMP examples here
for help with syntax and environment setup.
Check out the TBB example here.
You can specify the number of threads that TBB should use via the constuctor to the task_scheduler_init object.
Make use of the demo programs provided.
What to Hand In
Please turn this homework in on paper at the beginning of lecture.
Your code listings for Program 2 and Program 3.
A description of how you parallelized Program 2 and Program 3.
A printout of the simulation phase of Program 1.
A printout of the parallel phase of Program 2.
A printout of the parallel phase of Program 3.
Arguments for correctness of Programs 1, 2, and 3.
The plots as described in Problem 4.
Answers to questions in Problem 5.
Important: Include your name on EVERY page.
|