UW-Madison
Computer Sciences Dept.

CS 757 Computer Architecture II Spring 2011 Section 1
Instructor David A. Wood
URL: http://www.cs.wisc.edu/~david/courses/cs757/Spring2011/

Homework 1 // Due at Lecture Monday Feb 7

You should do this assignment alone. No late assignments.

The purpose of the assignment is to give you experience writing simple shared memory programs using OpenMP and MPI. This exercise is intended to provide a gentle introduction to parallel programming and will provide a foundation for writing and/or running much more complex programs on various parallel programming environments.

You will do this assignment on malbec.cs.wisc.edu - a Sun Fire T2000 Server containing a 64-thread Sun UltraSparc-T2 processor. We have given you individual accounts on this machine. Use them only for 757 homework assignments and, with instructor's permission, for your course project. The accounts and storage will be deleted at the end of the semester unless you obtain permission from the instructor for extending them. Run /usr/platform/sun4v/sbin/prtdiag -v on malbec, if you wish to learn more about the machine.

The original version of this assignment and the reference OpenMP programs were developed by Dan Gibson for CS838 offered in Fall 2005.

OpenMP

OpenMP is an API for shared-memory parallel programming for C/C++ and Fortran. It consists of preprocessor (compiler) directives, library routines and environment variables that determine the parallel execution of a program.

For this assignment, you will use the Sun Studio implementation of OpenMP that is already installed on malbec. A set of sample OpenMP programs are available here. A presentation on using OpenMP by Dan Gibson is available here. It is strongly recommended that you download and run them to get a hands-on experience of compiling, linking and running OpenMP programs. Remember to

  • Use Sun Studio cc compiler to compile your code.
  • Include omp.h in all the source files that use OpenMP directives or library calls.
  • Use the flag -xopenmp for compilation and linking of your source files.
You can also use this sample Makefile provided by Somayeh Sardashti.

MPI

MPI is a library specification for message-passing, proposed as a standard for use on parallel computers, clusters and heterogeneous networks. The MPI standard and related information is available at http://www.mcs.anl.gov/mpi. An online reference to MPI can be found at http://www-unix.mcs.anl.gov/mpi/www/. An MPI example program is available for review here. For MPI programs

  • Include mpi.h in all the source files that use MPI library calls.
  • Compile with the mpicc compiler located at /scratch/mpi/mpich2-1.0.8/bin/.
  • Set your LD_LIBRARY_PATH to point to /scratch/mpi/mpich2-1.0.8/lib/.
  • To run your MPI program, use mpiexec located at /scratch/mpi/mpich2-1.0.8/bin/.
  • You may have to start mpi using mpd & located at /scratch/mpi/mpich2-1.0.8/bin/ If mpd asks you to create .mpd.conf, do so. Make sure the permissions are the way mpd is asking them to be, and rerun mpd & after creating the file.

Programming Task: Ocean Simulation

OCEAN is a simulation of large-scale sea conditions from the SPLASH benchmark suite. It is a scientific workload used for performance evaluation of parallel machines. For this assignment, you will write two scaled-down versions of the variant of the Ocean benchmark.

Ocean is briefly described in Woo et al. on the Reading List. The scaled-down version you will implement is described below.

Our version of Ocean will simulate water temperatures using a large grid of integer values over a fixed number of time steps. At each time step, the value of a given grid location will be averaged with the values of its immediate north, south, east, and west neighbors to determine the value of that grid location in the next time step (total of five grid locations averaged to produce the next value for a given location).

As illustrated in Figure 1, value calculations for two adjacent grid locations are not independent. Specifically, the value of the grid location number 6 is calculated using the values of the grid locations 6, 2, 7, 10, and 5, and the value of the grid location number 10 is calculated using the values of the grid locations 10, 6, 11, 14, and 9. Because 6 depends on 10 and 10 depends on 6, the resulting values for 6 and 10 will depend on the order in which the values for 6 and 10 were calculated.

RebBlack RebBlack
Figure 1: Calculation Dependence Figure 2: Red and Black Independent Sets

The grid points in the ocean grid can be separated into two independent subsets shown in red and black in Figure 2. Instead of calculating a new value for each grid location at every time step, the values for the grid points in the red subset are updated on even time steps and the values for the grid points in the black subset are updated on odd time steps.

The edges of the grid shown in green in Figure 2 do not participate in the averaging process (they contribute a value, but their value does not change). Thus, Ocean will converge (given sufficient runtime) to a gradient of the water temperatures on the perimeter of the grid.

Problem 1: Write Sequential Ocean (5 points)

Write a single-threaded (sequential) version of Ocean as described above. This version of Ocean must take three arguments: the x-dimension of the grid, the y-dimension of the grid, and the number of time steps. You may assume for simplicity that all grid sizes will be powers of two plus two (i.e. (2^n)+2); therefore the area of the grid that will be modified will be sized to powers of two (+2 takes care of the edges that are not modified).

You are required to make an argument that your implementation of Ocean is correct. A good way to do this is to initialize the grid to a special-case starting condition, and then show that after a number of time steps the state of the grid exhibits symmetry or some other expected property. You need not prove your implementation's correctness in the literal since. However, please annotate any simulation outputs clearly.

Your final sequential version of Ocean should randomly initialize a grid of the requested size, then perform simulation for the specified number of time steps.

Problem 2: Write Parallel Ocean using OpenMP (5 points)

For this problem, you will use OpenMP directives to parallelize your sequential program. You are required to use the schedule(dynamic) clause on loops that you will parallelize with OpenMP. This will cause loop iterations to be dynamically allocated to threads. Please be sure to explicitly label all appropriate variables as either shared or private. Make an argument for the correctness of your implementation (it is acceptable to use the same argument as problem 1, provided it is still applicable).

The program should take an additional command-line argument: the number of threads to use in parallel sections. It is only required that you parallelize the main portion of the simulation, but parallelizing the initialization phase of Ocean is also worthwhile. You will not be penalized if you choose not to parallelize the initialization phase.

For simplicity, you may assume that the dimensions of the grid are powers of two plus two as before, and that only N=[1,2,4,8,16,32] will be passed as the number of threads.

Problem 3: Write Parallel Ocean using MPI (10 points)

For this problem, you will use MPI primitives to parallelize your sequential program. Note that the processes in message-passing programs do not share a single address space. So you will have to explicitly split your data structures between the various processes. Again, make an argument for the correctness of your implementation

For simplicity, you may assume that the dimensions of the grid are powers of two plus two, and that only N=[1,2,4,8,16,32] will be passed as the number of threads.

Problem 4: Analysis of Ocean (10 points)

Modify your programs to measure the execution time of the parallel phase of execution. Use of Unix's gethrtime() is recommended for OpenMP. Use MPI_Wtime() for timing your MPI program.

Compare the performance of your three Ocean implementations for a fixed number of time steps (100). Plot the normalized (versus the Sequential version of Ocean) speedups of your implementations on N=[1,2,4,8,16,32] threads for a 514x514 ocean. Note that the N=1 case should be the Sequential version of Ocean, not the parallel version using only 1 thread. Repeat for an ocean sized to 1026x1026.

What to Hand In

This assignment will be peer reviewed. Please bring two copies of your homework to lecture; you will give these to the two NEWLY ASSIGNED peer review members. Your answers to the discussion questions should be typed up, handwritten notes are not acceptable. A tarball of your entire source code including a Makefile and a README file, should be emailed to your peer group members before the beginning of lecture. The README should include 1) directions for compiling your code, 2) directions for running your code, 3) any other comments. Use subject line [CS/ECE 757] Homework 1, so that email filters work properly.

  • A printout of the source code for the simulation phase of Sequential Ocean (this is probably just a for loop).
  • A printout of the source code for the parallel phase of OpenMP Ocean (this is the code that is parallelized using schedule(dynamic)) and a concise description of how you parallelized the program.
  • Arguments for correctness of the two programs.
  • The plots as described in Problem 3 including a detailed explanation of the observed trends. Specifically, explain 1) differences between the slopes for the two grid sizes, 2) any changes in the slope for each of the grid sizes separately, 3) sources of superlinear speedups, 4) sources of sub-linear speedup.

Tips and Tricks

  • Start early. The machine will get very heavily loaded the two days before the assignment is due.
  • Set up RSA authentication on malbec to save yourself some keystrokes. HowTo.
  • Make use of the demo programs provided.
  • You can use /usr/platform/sun4v/sbin/prtdiag -v on malbec to learn many useful characteristics of your host machine.
  • Run your programs multiple times to get accurate time measurements. This will help avoid incorrect results due to interference with other user's programs.

 
Computer Sciences | UW Home