Homework 2 // Due at Lecture Tuesday, September 25, 2007
You will perform this assignment on two architectures. The first you have already used in the previous
homeworks (SPARCv9/Niagara), typified by chianti.cs.wisc.edu,
a 32-thread Sun UltraSparc-T1, where we have activated a CS account for you.
(Unless you already had an account, you should only use this machine
only for CS758 homework assignments or your course project.
Unless you have a longer connection for CS, this account and its storage
will be removed after the end of the semester.)
If you wish to learn more about chianti,
run /usr/platform/sun4v/sbin/prtdiag -v on chianti.
Chianti is a chip multiprocessor, with 8 4-way multi-threaded cores.
Each core has its own L1 cache and executes the threads in approximate round-robin order. The L2 cache is shared by all 8 cores (and 32 threads).
The architecture manual and this
tuning document describes additional details of this machine.
You will also use parallel x86-based hardware, namely Intel(R) Clovertown-based systems (more
commonly and generally known as Dual Core 2 Quad). You will use clover-01.cs.wisc.edu and
clover-02.cs.wisc.edu for the x86-based components of this assignment. Like chianti, the
clover machines are CMPs, but use aggressive out-of-order cores, have only four cores per chip
(two chips total), and don't use multithreaded cores. Each core has a private L1/L2 cache hierarchy, and
coherence between cores is routed through an external Northbridge controller.
You should do this assignment alone. No late assignments.
Purpose
The purpose of this assignment is to give you experience writing
a shared-memory program with OpenMP and Intel's Thread Building Blocks to build your understanding
of shared-memory programming, in general, and OpenMP/TBB, in particular.
Programming Environment 1: OpenMP
OpenMP is a shared-memory programming model that attempts to
automatically parallelize code that was written in a (mostly) serial
fashion. OpenMP makes extensive use of compiler directives and
optimizations, in addition to its own runtime library.
If you have not already done so, it is suggested that you review
the OpenMP references provided in the Reading List.
OpenMP uses a Fork/Join model similar to that of P-Threads, but
Fork/Join events are more frequent in OpenMP than in most P-Thread
based programs. Most OpenMP programs consist of
interleaved parallel and sequential sections, with "Fork" events
occurring at the start of each parallel section, and "Join" events
at the end of each parallel section. In non-parallel sections,
only the "master thread" executes.
In order to use the OpenMP environment on Chianti, students are
required to use the cc complier (not GCC or G++) on Unix-based
machines. Any source files that employ OpenMP directives or
library calls must include the omp.h header file. Additionally,
the flag -xopenmp must be passed to cc for both compilation and
linking. It is recommended that students also pass the -xO3
flag to cc, to avoid optimization-related warnings.
The OpenMP example programs demonstrated in class are available
for review here.
Programming Environment 2: TBB
Intel's Thread Building Blocks (TBB) package
provides a host of useful services to the parallel programmer, including some of the same
loop parallelization options provided by OpenMP (with different syntax, of course). Intel provides
a handy Getting Started Guide that is available at the link above under the Documentation
tab, which will show you everything you need to know about TBB for the purposes of this assignment.
A tutorial on TBB's loop parallelization is available here.
It will guide you through TBB setup and a brief illustrative example.
Programming Task: Ocean Simulation
OCEAN is a simulation of large-scale sea conditions from the SPLASH
benchmark suite. It is a scientific workload used for performance
evaluation of parallel machines. For this assignment, you will write
three scaled-down versions of the Ocean benchmark.
Ocean is briefly described in Woo et al. on the Reading List. The
scaled-down version you will implement is described below.
Our version of Ocean will simulate water temperatures using a large
grid of FIXED-point values over a fixed number of time steps (use
type int, or equivalent, for each cell in the ocean). At
each time step, a given grid location will be averaged with its
immediate north, south, east, and west neighbors
to determine the value of that grid location in the next time step.
The edges of the grid do not participate in the averaging process (they
contribute a value, but their value does not change). Thus, Ocean will
converge (given sufficient runtime) to a gradient of the water temperatures
on the perimeter of the grid.
This averaging pattern is repeated for each grid location at each time
step until the simulation terminates.
Note that simple iteration over the grid, top-to-bottom, right-to-left,
will cause the simulation to skew to the top-left corner of the grid,
if each location is updated on-the-fly. A simple way to counter this
effect is to maintain a shadow copy of the grid, and swap between the
two copies on every time step.
Problem 1: Write Sequential Ocean
Write a single-threaded (sequential) version of Ocean as described
above. This version of Ocean must take three arguments: the
x-dimension of the grid, the y-dimension of the grid, and the number
of time-steps. You may assume for simplicity that all grid sizes
will be powers of two plus two (the area of the grid that will be
modified will therefore be sized to powers of two).
You are required to make an argument that your implementation of
Ocean is correct. A good way to do this is to initialize the grid
to a special-case starting condition, and then show that after a number
of time steps the state of the grid exhibits symmetry or some other
expected property. You need not prove your implementation's
correctness in the literal since. However, please annotate any
simulation outputs clearly.
Your final sequential version of Ocean should randomly initialize a
grid of the requested size, then perform simulation for the specified
number of time steps.
It is in your best interests to write code that is friendly to both GCC and G++,
as the OpenMP environment is C-based, where TBB is C++-based. Your code
should compile cleanly on chianti AND on the clover machines.
Problem 2: Write OpenMP Ocean
You will implement two parallelized versions of Ocean, described
below. The first uses OpenMP to parallelize your source from Problem 1.
As with Problem 1, you are required to make an argument for
the correctness of your implementation (it is acceptable to use the
same argument, provided it is still applicable).
For this problem, you will use OpenMP directives to parallelize your
program from Problem 1. This program should take an additional
command-line argument: the number of threads to use in parallel
sections.
For this implementation, you are encouraged to experiment use the
schedule clause on loops that you will parallelize
with OpenMP. This will influence the OpenMP runtime's iteration scheduling.
It is only required that you parallelize the main portion of the
simulation, but parallelizing the initialization phase of Ocean is
also worthwhile. You will not be penalized if you choose not to
parallelize the initialization phase.
Implement this program in C on chianti We will compare the scalability and
raw performance of this program to the TBB implementation of Problem 3.
Problem 3: Write TBB Ocean
Starting again with your code from Problem 1, re-parallelize your code with Intel's
Thread Building Blocks. Implement this program in C++ on clover-01 or clover-02.
We will compare the scalability and raw performance of this program to the OpenMP
implementation of Problem 2. This program should take the same parameters as the program from
problem 1.
As with Problem 2, you are encouraged to use this assignment to explore the features of TBB,
though you will only need TBB's loop parallelization constructs.
Problem 4: Analysis of Ocean
In this section, we will analyze the performance of our three Ocean
implementations. We will use a fixed number of time steps (100).
Modify your programs to measure the execution time of the
parallel phase of execution. Use of Unix's gethrtime() and the x86 rdtsc instruction
is recommended. Do not use the shell's built-in time command.
Plot the normalized (versus the serial version of Ocean on the respective platforms) speedups of
Programs 2 and 3 on N=[1,2,3,4,5,6,7,8] for a 514x514 ocean. Repeat (on the same graph) for an ocean sized to 1026x1026.
Plot the total runtime of Programs 2 and 3 on N=[1,2,4,8,16,32] for a 514x514 ocean. Repeat (on the same graph) for an ocean sized to 1026x1026. The
N=1 case should be the serial version of Ocean, not the parallel version using only 1 thread.
Problem 5: Questions (Submission Credit)
- Which configuration had the best overall performance? Comment on the performance differences between the OpenMP/SPARC implementation of Problem 2 and the TBB/x86 implementation of Problem 3.
- Which configuration had the best overall scalability? Comment on the scalability differences between the OpenMP/SPARC implementation of Problem 2 and the TBB/x86 implementation of Problem 3.
- Comment on which programming environment you prefer.
Tips and Tricks
Start early.
Set up RSA authentication on chianti to save yourself some keystrokes. HowTo
Check out the pthreads and OpenMP examples here
for help with syntax and environment setup.
Check out the TBB example here.
You can specify the number of threads that TBB should use via the constuctor to the task_scheduler_init object.
Make use of the demo programs provided.
You can use /usr/platform/sun4u/sbin/prtdiag -v on chianti to learn
many useful characteristics of your host machine.
What to Hand In
Please turn this homework in on paper at the beginning of lecture.
A description of how you parallelized Program 2 and Program 3.
A printout of the simulation phase of Program 1.
A printout of the parallel phase of Program 2.
A printout of the parallel phase of Program 3.
Arguments for correctness of Programs 1, 2, and 3.
The plots as described in Problem 4.
Answers to questions in Problem 5.
Important: Include your name on EVERY page.
|