CS/ECE 758 Fall 2007

CS/ECE 758 Advanced Topics in Computer Architecture
Programming Current and Future Multicore Processors
Fall 2007 Section 1
Instructor David A. Wood and T. A. Dan Gibson
URL: http://www.cs.wisc.edu/~david/courses/cs758/Fall2007/

Homework 2 // Due at Lecture Tuesday, September 25, 2007

You will perform this assignment on two architectures. The first you have already used in the previous homeworks (SPARCv9/Niagara), typified by chianti.cs.wisc.edu, a 32-thread Sun UltraSparc-T1, where we have activated a CS account for you. (Unless you already had an account, you should only use this machine only for CS758 homework assignments or your course project. Unless you have a longer connection for CS, this account and its storage will be removed after the end of the semester.)

If you wish to learn more about chianti, run /usr/platform/sun4v/sbin/prtdiag -v on chianti. Chianti is a chip multiprocessor, with 8 4-way multi-threaded cores. Each core has its own L1 cache and executes the threads in approximate round-robin order. The L2 cache is shared by all 8 cores (and 32 threads). The architecture manual and this tuning document describes additional details of this machine.

You will also use parallel x86-based hardware, namely Intel(R) Clovertown-based systems (more commonly and generally known as Dual Core 2 Quad). You will use clover-01.cs.wisc.edu and clover-02.cs.wisc.edu for the x86-based components of this assignment. Like chianti, the clover machines are CMPs, but use aggressive out-of-order cores, have only four cores per chip (two chips total), and don't use multithreaded cores. Each core has a private L1/L2 cache hierarchy, and coherence between cores is routed through an external Northbridge controller.

You should do this assignment alone. No late assignments.

Purpose

The purpose of this assignment is to give you experience writing a shared-memory program with OpenMP and Intel's Thread Building Blocks to build your understanding of shared-memory programming, in general, and OpenMP/TBB, in particular.

Programming Environment 1: OpenMP

OpenMP is a shared-memory programming model that attempts to automatically parallelize code that was written in a (mostly) serial fashion. OpenMP makes extensive use of compiler directives and optimizations, in addition to its own runtime library.

If you have not already done so, it is suggested that you review the OpenMP references provided in the Reading List.

OpenMP uses a Fork/Join model similar to that of P-Threads, but Fork/Join events are more frequent in OpenMP than in most P-Thread based programs. Most OpenMP programs consist of interleaved parallel and sequential sections, with "Fork" events occurring at the start of each parallel section, and "Join" events at the end of each parallel section. In non-parallel sections, only the "master thread" executes.

In order to use the OpenMP environment on Chianti, students are required to use the cc complier (not GCC or G++) on Unix-based machines. Any source files that employ OpenMP directives or library calls must include the omp.h header file. Additionally, the flag -xopenmp must be passed to cc for both compilation and linking. It is recommended that students also pass the -xO3 flag to cc, to avoid optimization-related warnings.

The OpenMP example programs demonstrated in class are available for review here.

Programming Environment 2: TBB

Intel's Thread Building Blocks (TBB) package provides a host of useful services to the parallel programmer, including some of the same loop parallelization options provided by OpenMP (with different syntax, of course). Intel provides a handy Getting Started Guide that is available at the link above under the Documentation tab, which will show you everything you need to know about TBB for the purposes of this assignment.

A tutorial on TBB's loop parallelization is available here. It will guide you through TBB setup and a brief illustrative example.

Programming Task: Ocean Simulation

OCEAN is a simulation of large-scale sea conditions from the SPLASH benchmark suite. It is a scientific workload used for performance evaluation of parallel machines. For this assignment, you will write three scaled-down versions of the Ocean benchmark.

Ocean is briefly described in Woo et al. on the Reading List. The scaled-down version you will implement is described below.

Our version of Ocean will simulate water temperatures using a large grid of FIXED-point values over a fixed number of time steps (use type int, or equivalent, for each cell in the ocean). At each time step, a given grid location will be averaged with its immediate north, south, east, and west neighbors to determine the value of that grid location in the next time step. The edges of the grid do not participate in the averaging process (they contribute a value, but their value does not change). Thus, Ocean will converge (given sufficient runtime) to a gradient of the water temperatures on the perimeter of the grid.

This averaging pattern is repeated for each grid location at each time step until the simulation terminates.

Note that simple iteration over the grid, top-to-bottom, right-to-left, will cause the simulation to skew to the top-left corner of the grid, if each location is updated on-the-fly. A simple way to counter this effect is to maintain a shadow copy of the grid, and swap between the two copies on every time step.

Problem 1: Write Sequential Ocean

Write a single-threaded (sequential) version of Ocean as described above. This version of Ocean must take three arguments: the x-dimension of the grid, the y-dimension of the grid, and the number of time-steps. You may assume for simplicity that all grid sizes will be powers of two plus two (the area of the grid that will be modified will therefore be sized to powers of two).

You are required to make an argument that your implementation of Ocean is correct. A good way to do this is to initialize the grid to a special-case starting condition, and then show that after a number of time steps the state of the grid exhibits symmetry or some other expected property. You need not prove your implementation's correctness in the literal since. However, please annotate any simulation outputs clearly.

Your final sequential version of Ocean should randomly initialize a grid of the requested size, then perform simulation for the specified number of time steps.

It is in your best interests to write code that is friendly to both GCC and G++, as the OpenMP environment is C-based, where TBB is C++-based. Your code should compile cleanly on chianti AND on the clover machines.

Problem 2: Write OpenMP Ocean

You will implement two parallelized versions of Ocean, described below. The first uses OpenMP to parallelize your source from Problem 1. As with Problem 1, you are required to make an argument for the correctness of your implementation (it is acceptable to use the same argument, provided it is still applicable).

For this problem, you will use OpenMP directives to parallelize your program from Problem 1. This program should take an additional command-line argument: the number of threads to use in parallel sections.

For this implementation, you are encouraged to experiment use the schedule clause on loops that you will parallelize with OpenMP. This will influence the OpenMP runtime's iteration scheduling.

It is only required that you parallelize the main portion of the simulation, but parallelizing the initialization phase of Ocean is also worthwhile. You will not be penalized if you choose not to parallelize the initialization phase.

Implement this program in C on chianti We will compare the scalability and raw performance of this program to the TBB implementation of Problem 3.

Problem 3: Write TBB Ocean

Starting again with your code from Problem 1, re-parallelize your code with Intel's Thread Building Blocks. Implement this program in C++ on clover-01 or clover-02. We will compare the scalability and raw performance of this program to the OpenMP implementation of Problem 2. This program should take the same parameters as the program from problem 1.

As with Problem 2, you are encouraged to use this assignment to explore the features of TBB, though you will only need TBB's loop parallelization constructs.

Problem 4: Analysis of Ocean

In this section, we will analyze the performance of our three Ocean implementations. We will use a fixed number of time steps (100).

Modify your programs to measure the execution time of the parallel phase of execution. Use of Unix's gethrtime() and the x86 rdtsc instruction is recommended. Do not use the shell's built-in time command.

Plot the normalized (versus the serial version of Ocean on the respective platforms) speedups of Programs 2 and 3 on N=[1,2,3,4,5,6,7,8] for a 514x514 ocean. Repeat (on the same graph) for an ocean sized to 1026x1026.

Plot the total runtime of Programs 2 and 3 on N=[1,2,4,8,16,32] for a 514x514 ocean. Repeat (on the same graph) for an ocean sized to 1026x1026. The N=1 case should be the serial version of Ocean, not the parallel version using only 1 thread.

Problem 5: Questions (Submission Credit)

Which configuration had the best overall performance? Comment on the performance differences between the OpenMP/SPARC implementation of Problem 2 and the TBB/x86 implementation of Problem 3.
Which configuration had the best overall scalability? Comment on the scalability differences between the OpenMP/SPARC implementation of Problem 2 and the TBB/x86 implementation of Problem 3.
Comment on which programming environment you prefer.

Tips and Tricks

Start early.

Set up RSA authentication on chianti to save yourself some keystrokes. HowTo

Check out the pthreads and OpenMP examples here for help with syntax and environment setup.

Check out the TBB example here.

You can specify the number of threads that TBB should use via the constuctor to the task_scheduler_init object.

Make use of the demo programs provided.

You can use /usr/platform/sun4u/sbin/prtdiag -v on chianti to learn many useful characteristics of your host machine.

What to Hand In

Please turn this homework in on paper at the beginning of lecture.

A description of how you parallelized Program 2 and Program 3.

A printout of the simulation phase of Program 1.

A printout of the parallel phase of Program 2.

A printout of the parallel phase of Program 3.

Arguments for correctness of Programs 1, 2, and 3.

The plots as described in Problem 4.

Answers to questions in Problem 5.

Important: Include your name on EVERY page.

Computer Sciences | UW Home

Page last modified: Sunday, 16-Sep-2007 15:21:07 CDT
Feedback or content questions: send email to Email Address of Mark Hill
Technical or accessibility issues: lab@cs.wisc.edu
Copyright © 2002, 2003 The Board of Regents of the University of Wisconsin System.

CS/ECE 758 Advanced Topics in Computer Architecture Programming Current and Future Multicore Processors Fall 2007 Section 1 Instructor David A. Wood and T. A. Dan Gibson URL: http://www.cs.wisc.edu/~david/courses/cs758/Fall2007/

Programming Current and Future Multicore Processors