CS/ECE 758 Fall 2009

CS 758 Advanced Topics in Computer Architecture
Programming Current and Future Multicore Processors
Fall 2009 Section 1
Instructor David A. Wood and T. A. Matthew D. Allen
URL: http://www.cs.wisc.edu/~david/courses/cs758/Fall2009/

Homework 4 // Due at Lecture Monday, October 12, 2009

You will perform this assignment on the x86-64 Nehalem-based systems you used in Homework 3: ale-01.cs.wisc.edu and ale-02.cs.wisc.edu.

You should do this assignment alone. No late assignments.

Purpose

The purpose of this assignment is to further explore the features of Intel's (R) Thread Building Blocks (TBB) multithreading package, including its task managment, synchronization, and (optionally) concurrent data structures.

Programming Environment: TBB

Intel's Thread Building Blocks (TBB) package provides a host of useful services to the parallel programmer, including some of the same loop parallelization options provided by OpenMP (with different syntax, of course). Intel provides a handy Getting Started Guide that is available at the link above under the Documentation tab, which will show you everything you need to know about TBB for the purposes of this assignment. You will find the Tutorial document very useful for this assignment.

Programming Task: Othello AI

Othello (aka, Reversi) is a strategy board game, played on an 8x8 board. Othello is a game for two players; moves consist of placing tokens on the board to flip your opponent's tokens, subject to placement rules. You are not required to learn the rules to Othello, but it is probably a good idea nonetheless, since you will be parallelizing an Othello AI.

In this assignment, you are given a serial implementation of a recursive lookahead-based Othello AI (You can download the code here). Your task is to parallelize and improve the code in a variety of ways. You will do so using the features of Intel's Thread Building Blocks package.

The serial version you are given starts with the current Othello board position, enumerates all possible moves, and then recusively evaluates all potential move combinations, up to the desired lookahead, aka the depth of the search. The serial version is not very intelligent in its approach to Othello AI. A complete evaluation of the search space is not necessary, and the fitness criterion used by the implementation is very basic. You can improve on these shortcomings if it suits your interests. In general, you may modify any part of the provided code.

The AI plays the game against itself, and uses a greedy algorithm to select the moves for the white player, and only uses its depth-search capability to discover the black player's moves. This is done to limit overall runtime, but you may easily change this if you prefer to use the depth-search algorithm for both players.

Problem 1: Parallel Othello AI using parallel_do

The provided code has plenty of candidate loops that could be parallelized. To fulfill the requirements of this problem, you must parallelize one or more of these loops using TBB's parallel_do construct, and any other elements of the TBB package that you may desire (parallel_do is the only requirement).

Your choice of which loop(s) to parallelize will determine the difficulty of this problem, as well as your potential for speedup. You are encouraged to experiment with many different options -- once you have learned the parallel_do syntax, it is relatively easy to try several different loops. You will be expected to explain your choise of loop(s) in Problem 4.

Problem 2: Parallel Othello AI using Task Recursion

Everything in TBB is a task -- up to now, we have used TBB's loop parallelization capability to implicitly form tasks to run in parallel. Now, instead of parallelizing a loop, we will parallelize the recursive operation of the original algorithm.

Starting again with the serial code, parallelize the recursive search algorithm using TBB's spawnable tasks (similar to the Fibonacci example of section 10.2 of Intel's Tutorial document). To fulfill the requirements of this problem, your revised code must include at least one C++ class that inherits from class task in the TBB package (and, naturally, this task must be called recursively, in parallel).

Optional: Explore the multitude of tweaking options available in TBB -- continuations, scheduler bypass, and task recycling, for instance. You might also want to try some other TBB features along the way (concurrent data structures, locks of various flavors, atomic operations, scalable allocators, etc.) -- this will be the last TBB assignment.

Problem 3: Evaluation

Evaluate your code on the Nehalem (ale) platform. Provide a table showing total execution time, in seconds, for the original algorithm, your parallel_do parallelization from Problem 1, and your recursive task parallelization from Problem 2 as rows of the table. Include columns for lookahead = [3,4,5,6,7]. For example (the numbers below are 100% ficticious). Note on the graph the number of threads used to attain the best performance (or if you used the TBB default value). This value need not be constant across all implementations (e.g. N=6 for Problem 1 and N=8 for Problem 2 is acceptible).

Implementation

Lookahead

	3	4	5	6	7
Serial	1s	2s	5s	60s	10000s
PDo	1s	1s	2s	10s	100s
PTask	1s	1s	2s	10s	100s

Problem 4: Questions (Submission Credit)

Which loop(s) did you select for parallelization in Problem 1? Why did you choose that/those loop(s)? Why were other loops unsuitable? Use specific examples.
Which implementation (parallel_do or Task-based Recursion) performs better for lookaheads 6 and 7? Which implementation did you prefer?
How many total board positions are examined in your implemenatations for Lookahead = 6? Lookahead = 7? What is the rate at which boards are examined in each of the above implementations? You may calculate or measure the value.

Tips and Tricks

Start early.

Read Intel's TBB Tutorial (Select Documentation tab).

Don't forget to add -ltbb and other useful switches to the provided Makefile.

Don't forget to source TBB's environment variables!

What to Hand In

Please turn this homework in on paper at the beginning of lecture.

A printout of your parallel_do code from Problem 1, annotated to indicate which loop from the original program you parallelized.

A printout of your task class and stream class from Problem 2, annotated to indicate your strategy for task recursion.

If you improved the AI in any way, a brief description of the improvements (keep your code!).

The table from Problem 3.

Answers to questions in Problem 4.

Important: Include your name on EVERY page.

Computer Sciences | UW Home

Feedback or content questions: send email to "david" at the cs.wisc.edu server
Technical or accessibility issues: lab@cs.wisc.edu
Copyright © 2002, 2003 The Board of Regents of the University of Wisconsin System.

CS 758 Advanced Topics in Computer Architecture Programming Current and Future Multicore Processors Fall 2009 Section 1 Instructor David A. Wood and T. A. Matthew D. Allen URL: http://www.cs.wisc.edu/~david/courses/cs758/Fall2009/

Programming Current and Future Multicore Processors