CS/ECE 758 Fall 2009

CS 758 Advanced Topics in Computer Architecture
Programming Current and Future Multicore Processors
Fall 2009 Section 1
Instructor David A. Wood and T. A. Matthew D. Allen
URL: http://www.cs.wisc.edu/~david/courses/cs758/Fall2009/

Homework 3 // Due at Lecture Monday, October 5, 2009

You will perform this assignment on two architectures: eight-core, eight-way threaded (64 threads total) Sun UltraSPARC-T2 (gamay.cs.wisc.edu), and dual socket quad-core two-way threaded (16 threads total) Intel Nehalem processors (ale-01.cs.wisc.edu and ale-02.cs.wisc.edu).

As always, if you wish to learn more about gamay, run /usr/platform/sun4v/sbin/prtdiag -v on gamay. The architecture manual and this tuning document describes additional details of this machine.

You may find similar information on the x86-64 Linux machines by examining the /proc/cpuinfo file.

You should do this assignment alone. No late assignments.

Purpose

The purpose of this assignment is to give you experience writing a shared-memory program with OpenMP and Intel's Thread Building Blocks to build your understanding of shared-memory programming, in general, and OpenMP/TBB, in particular.

Programming Environment 1: OpenMP

OpenMP is a shared-memory programming model that attempts to automatically parallelize code that was written in a (mostly) serial fashion. OpenMP makes extensive use of compiler directives and optimizations, in addition to its own runtime library.

If you have not already done so, it is suggested that you review the OpenMP references provided in the Reading List.

OpenMP uses a Fork/Join model similar to that of P-Threads, but Fork/Join events are more frequent in OpenMP than in most P-Thread based programs. Most OpenMP programs consist of interleaved parallel and sequential sections, with "Fork" events occurring at the start of each parallel section, and "Join" events at the end of each parallel section. In non-parallel sections, only the "master thread" executes.

In order to use the OpenMP environment on Gamay, students may use the cc complier machines. Any source files that employ OpenMP directives or library calls must include the omp.h header file. Additionally, the flag -xopenmp must be passed to cc for both compilation and linking. It is recommended that students also pass the -xO3 flag to cc, to avoid optimization-related warnings.

In order to use the OpenMP environment on the ale nodes, you should compile your program with gcc or g++ and provide the -fopenmp for both the compile and link steps.

A set of OpenMP example programs are available for review here.

Programming Environment 2: TBB

Intel's Thread Building Blocks (TBB) package provides a host of useful services to the parallel programmer, including some of the same loop parallelization options provided by OpenMP (with different syntax, of course). Intel provides a handy Getting Started Guide that is available at the link above under the Documentation tab, which will show you everything you need to know about TBB for the purposes of this assignment.

A tutorial on TBB's loop parallelization is available here. It will guide you through TBB setup and a brief illustrative example.

Programming Task: Ocean Simulation

OCEAN is a simulation of large-scale sea conditions from the SPLASH benchmark suite. It is a scientific workload used for performance evaluation of parallel machines. For this assignment, you will write three scaled-down versions of the Ocean benchmark.

Ocean is briefly described in Woo et al. on the Reading List. The scaled-down version you will implement is described below.

Our version of Ocean will simulate water temperatures using a large grid of floating-point values over a fixed number of time steps (use type float, or equivalent, for each cell in the ocean). At each time step, a given grid location will be averaged with its immediate north, south, east, and west neighbors to determine the value of that grid location in the next time step. The edges of the grid do not participate in the averaging process (they contribute a value, but their value does not change). Thus, Ocean will converge (given sufficient runtime) to a gradient of the water temperatures on the perimeter of the grid.

This averaging pattern is repeated for each grid location at each time step until the simulation terminates.

Note that simple iteration over the grid, top-to-bottom, right-to-left, will cause the simulation to skew to the top-left corner of the grid, if each location is updated on-the-fly. A simple way to counter this effect is to maintain a shadow copy of the grid, and swap between the two copies on every time step.

Problem 1: Write Sequential Ocean

Write a single-threaded (sequential) version of Ocean as described above. This version of Ocean must take three arguments: the x-dimension of the grid, the y-dimension of the grid, and the number of time-steps. You may assume for simplicity that all grid sizes will be powers of two plus two (the area of the grid that will be modified will therefore be sized to powers of two).

You are required to make an argument that your implementation of Ocean is correct. A good way to do this is to initialize the grid to a special-case starting condition, and then show that after a number of time steps the state of the grid exhibits symmetry or some other expected property. You need not prove your implementation's correctness in the literal since. However, please annotate any simulation outputs clearly.

Your final sequential version of Ocean should randomly initialize a grid of the requested size, then perform simulation for the specified number of time steps.

It is in your best interests to write code that is friendly to both GCC and G++, as the OpenMP environment is C-based, where TBB is C++-based. Your code should compile cleanly on gamay AND on the ale machines.

Problem 2: Write OpenMP Ocean

You will implement two parallelized versions of Ocean, described below. The first uses OpenMP to parallelize your source from Problem 1. As with Problem 1, you are required to make an argument for the correctness of your implementation (it is acceptable to use the same argument, provided it is still applicable).

For this problem, you will use OpenMP directives to parallelize your program from Problem 1. This program should take an additional command-line argument: the number of threads to use in parallel sections.

For this implementation, you are encouraged to experiment use the schedule clause on loops that you will parallelize with OpenMP. This will influence the OpenMP runtime's iteration scheduling.

It is only required that you parallelize the main portion of the simulation, but parallelizing the initialization phase of Ocean is also worthwhile. You will not be penalized if you choose not to parallelize the initialization phase.

We will compare the scalability and raw performance of this program to the TBB implementation of Problem 3.

Problem 3: Write TBB Ocean

Starting again with your code from Problem 1, re-parallelize your code with Intel's Thread Building Blocks. We will compare the scalability and raw performance of this program to the OpenMP implementation of Problem 2. This program should take the same parameters as the program from problem 1.

As with Problem 2, you are encouraged to use this assignment to explore the features of TBB, though you will only need TBB's loop parallelization constructs.

Problem 4: Analysis of Ocean

In this section, we will analyze the performance of our three Ocean implementations. We will use a fixed number of time steps (100).

Modify your programs to measure the execution time of the parallel phase of execution. Use of Unix's gethrtime(), gettimeofday(), or clock_gettime is recommended. Do not use the shell's built-in time command.

Plot the normalized (versus the serial version of Ocean for both gamay and ale platforms for the OpenMP version, and on ale only for the TBB version) speedups of Programs 2 and 3 on N=[1,2,3,4,5,6,7,8] for a 514x514 ocean. Repeat (on the same graph) for an ocean sized to 1026x1026.

Plot the total runtime of Programs 2 and 3 on N=[1,2,4,8,16,32,64] for a 514x514 ocean. Repeat (on the same graph) for an ocean sized to 1026x1026. The N=1 case should be the serial version of Ocean, not the parallel version using only 1 thread.

Problem 5: Questions (Submission Credit)

Which configuration had the best overall performance? Comment on the performance differences between the OpenMP implementation of Problem 2 and the TBB implementation of Problem 3.
Which configuration had the best overall scalability? Comment on the scalability differences between the OpenMP implementation of Problem 2 and the TBB implementation of Problem 3.
Comment on the overall performance as well as the scalability of the Sun Niagara 2 (gamay) platform vs. the Intel Nehalem (ale) platform.
Comment on which programming environment you prefer.

Tips and Tricks

Start early.

Set up RSA authentication on gamay to save yourself some keystrokes. HowTo

Check out the pthreads and OpenMP examples here for help with syntax and environment setup.

Check out the TBB example here.

You can specify the number of threads that TBB should use via the constuctor to the task_scheduler_init object.

Make use of the demo programs provided.

What to Hand In

Please turn this homework in on paper at the beginning of lecture.

Your code listings for Program 2 and Program 3.

A description of how you parallelized Program 2 and Program 3.

A printout of the simulation phase of Program 1.

A printout of the parallel phase of Program 2.

A printout of the parallel phase of Program 3.

Arguments for correctness of Programs 1, 2, and 3.

The plots as described in Problem 4.

Answers to questions in Problem 5.

Important: Include your name on EVERY page.

Computer Sciences | UW Home

Page last modified: Thursday, 26-Aug-2010 11:48:11 CDT
Feedback or content questions: send email to Email Address of Mark Hill
Technical or accessibility issues: lab@cs.wisc.edu
Copyright © 2002, 2003 The Board of Regents of the University of Wisconsin System.

CS 758 Advanced Topics in Computer Architecture Programming Current and Future Multicore Processors Fall 2009 Section 1 Instructor David A. Wood and T. A. Matthew D. Allen URL: http://www.cs.wisc.edu/~david/courses/cs758/Fall2009/

Programming Current and Future Multicore Processors