Homework 2 // Due at Lecture Monday, October 4, 2010
You will perform this assignment on three architectures: eight-core,
eight-way threaded (64 threads total) Sun UltraSPARC-T2
(gamay.cs.wisc.edu), dual socket quad-core two-way threaded (16
threads total) Intel Nehalem processors (ale-01.cs.wisc.edu
and ale-02.cs.wisc.edu), and a quad socket quad-core AMD
Barcelona processor (sirius.cs.wisc.edu).
As always, if you wish to learn more about gamay, run
/usr/platform/sun4v/sbin/prtdiag -v on gamay. The architecture
manual and this tuning
document describes additional details of this machine.
You may find similar information on the x86-64 Linux machines by
examining the /proc/cpuinfo file.
The key difference in the x86-based hardware for this assignment is
that the Nehalem machines are two-way multithreaded -- each
thread shares a core, an L1, and an L2 with the other thread, whereas
the
Barcelona-based machine uses multiple single-threaded cores.
Note that for multithreaded x86 cores it is important to insert the
PAUSE instruction in spin-wait loops.
This can significantly improve performance of spin-wait loops because
it avoids a memory-order violation on loop exit, and allows the
processor to throttle the execution of the spinning thread, leaving
more resources for the other thread executing on the core. (The PAUSE instruction is a no-op on
non-multithreaded implementations.) There is a macro provided with your assignment code
that will compile correctly on all architectures you will be using.
You should do this assignment alone. No late assignments.
Purpose
The purpose of this assignment is to give you some experience writing
a non-trivial shared-memory concurrent data structure. Additionally,
you will implement a rudimentary version of high-level concurrency
control, and will explore high-level and low-level synchronization of
concurrent data structures.
Programming Environment: POSIX Threads
This assignment marks a return to the environment from Homework 1,
namely POSIX threads (pthreads). We will use pthreads in this
assignment primarily because of its easy portability (Linux/x86 and
Solaris/SPARCv9 are both POSIX-compliant platforms).
Unlike the previous pthread assignment, the mechanics of thread
creation and the orchestration of thread movement has been done for
you. As you write your code, take care to avoid any platform-specific
function calls, as you will be required to evaluate your
implementation on all three of the platforms listed above.
Programming Task: Concurrent Binary Tree
Trees of various flavors are common -- perhaps the simplest is the
unbalanced binary tree, which will be the subject of this
assignment. In general, you will discover that binary trees are not
the most scalable of data structures, but they are conceptually
simple.
You are given an implementation of a concurrent binary tree, which has
been (somewhat) optimized for a small number of threads. Your task is
to improve the tree's scalability and throughput on highly-concurrent
machines, such as Sun's T2000 servers.
The tree in question is used to implement a map abstraction. Maps are used to associate a unique key with a value, which
for simplicity is also unique. Maps can be implemented with a variety of data structures (hash tables are common), though your implementation
must use a binary tree. The tree supports the following
primitive operations in parallel:
int Lookup( int key );
void Remove( int key );
void Set( int key, int value );
The above functions may be called in parallel by any number of
threads, up to the number of thread specified in the constructor of
the tree. The tree's constructor and destructor, as well as the print() function, need not be thread-safe.
Serializability Rules
Providing primitive atomic operations is often necessary but is not
always a sufficient interface for writing parallel codes. In
particular, it is desireable for semantically-related sequences of
operations to appear atomic, even in the presence of concurrency.
We would like these sequences (hereafter transactions on the
data structure) to be serializeable, that is, to appear as
though the transactions executed in a single global order. (Consult
your reading list for more on serializability)
Hence, in addition to providing the primitives above, the tree sports
a transactional interface:
void InitiateTransaction();
void CommitTransaction();
void TransactionAborted();
bool TransactionalLookup( int &data, int key );
bool TransactionalRemove( int key );
bool TransactionalSet( int key, int value );
In particular, a thread begins a transaction by calling InitiateTransaction(). Subsequent calls to the
transactional access functions (TransactionalLookup, TransactionalRemove, and TransactionalSet) return a boolean value,
indicating whether the current transaction must abort -- undo
its changes and restart -- because of a violation of
serializability. The transaction is ended by a call to CommitTransaction(). If the transaction is
aborted, TransactionAborted() is called by
the thread once the thread has undone all of its changes to the
tree. The precise rules for implementing these functions are discussed
below.
The transactional and non-transactional interfaces are not
intended to be used at the same time -- the tree need not respect
transactional semantics if transactional and non-transactional
interfaces are used concurrently. However, all non-transactional
access
must remain atomic in all cases.
Serializability v. Atomic Access Examples
Consider these two transactions that wish to operate on our example binary tree (abort code not shown for simplicity):
Thread 1
tree.InitiateTransaction();
TransactionalLookup( x, 7 );
if( x > 5 ) {
TransactionalRemove(7);
}
tree.CommitTransaction();
|
Thread 2
tree.InitiateTransaction();
TransactionalLookup( y, 8 );
if( y > 5 ) {
TransactionalSet(8,5);
TransactionalSet(9,y-5);
}
tree.CommitTransaction();
|
Notice that these two transactions seem to be composable -- their read
and write sets don't overlap at all (Thread 1 looks at 7 and might
remove it, Thread 2 looks at 8 and might update 8 and create 9). But
when the transactions are applied to the tree below, we see that there
actually is a dire need to synchronize these accesses (assume
the implementation of Remove() seeks to
avoid rotations by promoting leaf children over non-leaf children,
promotes the right child in the event of a tie, and doesn't enforce
balance).
<5,5>
/ \
<3,3> <7,7>
/ / \
<2,2> <6,6> <8,8>
Depending on the ordering of the transactions, both of the following trees are valid, and either could result from a serializeable
execution:
<5,5>
/ \
<3,3> <8,5>
/ / \
<2,2> <6,6> <9,3>
|
<5,5>
/ \
<3,3> <6,6>
/ \
<2,2> <8,5>
\
<9,3>
|
This example illustrates the high-level notion of transaction
serializability (both trees represent the same map, but are
structurally different). The next example illustrates that
individual atomic operations alone are not sufficient to guarantee
transactional serializability. Consider:
Thread 1
int sum = 0;
tree.InitiateTransaction();
for(int i=0;i<N;i++) {
TransactionalLookup( x, i );
sum += x;
}
tree.CommitTransaction();
|
Thread 2
tree.InitiateTransaction();
TransactionalLookup( y, 8 );
if( y > 5 ) {
TransactionalSet(8,5);
TransactionalSet(9,y-5);
}
tree.CommitTransaction();
|
Thread 1 is trying to find the sum of all the data values in the tree,
while Thread 2 is attempting the same transaction as the earlier
example. Suppose initially the tree looks as it did for the previous
example:
<5,5>
/ \
<3,3> <7,7>
/ / \
<2,2> <6,6> <8,8>
We assume that each TransactionalX() call
is atomic. Hence, the dynamic sequences of TransactionalX() calls from each thread are
(assume N is 10):
Thread 1 | Thread 2 |
Lookup( x, 0 );
Lookup( x, 1 );
Lookup( x, 2 );
Lookup( x, 3 );
Lookup( x, 4 );
Lookup( x, 5 );
Lookup( x, 6 );
Lookup( x, 7 );
Lookup( x, 8 );
Lookup( x, 9 );
|
Lookup( y, 8 );
Set( 8, 5 );
Set( 9, y-5 );
|
The course of dynamic execution could interleave these accesses in any
way that orders their single-threaded executions. Suppose we
interleave the operations such that Thread 2's Set operations occur near the end of Thread 1's
scan:
Thread 1 | Order at Tree | Thread 2 |
Lookup( x, 0 );
Lookup( x, 1 );
Lookup( x, 2 );
Lookup( x, 3 );
Lookup( x, 4 );
Lookup( x, 5 );
Lookup( x, 6 );
Lookup( x, 7 );
Lookup( x, 8 );
Lookup( x, 9 );
|
Lookup( x, 0 );
Lookup( x, 1 );
Lookup( x, 2 );
Lookup( x, 3 );
Lookup( x, 4 );
Lookup( x, 5 );
Lookup( x, 6 );
Lookup( x, 7 );
Lookup( y, 8 );
Set( 8, 5 );
Lookup( x, 8 );
Lookup( x, 9 );
Set( 9, y-5 );
|
Lookup( y, 8 );
Set( 8, 5 );
Set( 9, y-5 );
|
A problem arises because Thread 1 observes Thread 2's change to key 8
but not Thread 2's addition of key 9. Hence, the sum that Thread 1
derives is not the sum of any valid tree.
Provided Code
The code provided (here)
is an impelementation of the concurrent tree in C++, along with a host
of test programs, that compiles cleanly on all three architectures on
which you will perform this assignment. The purpose of each file is
briefly summarized below:
File | Contents |
Barrier.* | Implements an object-oriented barrier using POSIX interfaces |
CTree.* | Implements a concurrent binary tree -- you will heavily modify these files
in this assignment. |
fatals.* | Bails out of the program, displaying an error message. |
main.C | Spawns threads according to the number of available processors,
runs the tests in Tests.C according to preprocessor options. |
ProcMap.* | Determines the physical processor identifiers for the system, and gives
a generic interface to processor affinity. |
Stats.* | Tracks some statistics about the execution. |
Tests.* | Provides a set of tests for the concurrent tree, including a single-thread test,
a parallel non-transactional torture test, a transactional torture test, and the
throughput test. You may modify the single-threaded test and the torture tests
as you wish for your own testing purposes. You may not modify the throughput test
in any way. |
Transactions.* | Implements some transactions for testing and throughput analysis. The
transactions may not be modified (except for the purpose of debugging your code). |
Your final solution must pass all provided torture tests. To allow for expediency, you need only report your
statistics for the throughput test. Preprocessor flags in main.C can be toggled to run only a subset of the
available tests at a time. These tests include:
- Serial tests: These tests exercise the tree using only a single thread.
- Parallel tests: These tests stress the tree using multiple threads, but using only atomic operations (no transactions).
- Transactional tests: This test runs many intensive concurrent transactions on the tree, and looks for serializability violations.
- Throughput test: This test runs realistic transactions on the tree. You will report your transaction throughput from this test.
Transaction Types
This section is informative -- it details the types of transactions that are implemented in Transactions.*.
Name | Behavior |
Update | Reads then writes a small number of random elements. |
Conditional Add | If a given key does not exist in the map, the key is inserted with value 0. |
Conditional Remove | If a given key exists in the map and has value 0, the key is removed. |
Lookup | Small read-only query of several key/value pairs. |
Scan | Large read-only query of many key/value pairs. |
TortureX | "Torture" version of transaction X. Regular transactions include delays to emulate work
within the transaction. The torture versions have this delay removed to place additional stress on the tree. |
(Additional) Rules for Implementation
- You may add any public or private data members, or even whole classes, to CTree.*, but you may not
modify existing interfaces.
- You may perform any additional tests on your code if you wish. Feel free to modify any files you require to add additional
testing, but you must be able to run the original throughput test on your concurrent tree implementation.
- If you elect to abort a transaction that calls TransactionalLookup(),
TransactionalSet(), or TransactionalRemove() (by returning
true from any of those functions), the tree should be left unmodified. The code in
Transaction.C will not undo the action that caused the abort. If you wish to change this
semantic of the interface, you must re-code Transaction.C appropriately.
- The Remove family of operations must actually remove a node from the tree.
- class ConcurrentTree must be implemented as a binary tree.
- Degenerate cases of Lookup(), Set(), Remove(),
and their transactional counterparts are not error cases. Lookup()s of non-existent keys should
return NOT_IN_TREE. Set()s on non-existent keys should create a new node and
insert it into the tree. Remove()s of non-existent keys should not change the tree, nor should they raise
an error of any kind.
Problem 1: Improve Concurrency of Atomic Tree Operations
The code provided (here) uses a single lock to protect the entire
binary tree while performing Lookup(), Set(), and Remove()
operations. Improve this locking scheme in any way you prefer, subject to the rules above. Your code should compile cleanly on all target
architectures.
Remember: This level of synchronization is intended to protect the physical validity of the tree.
Problem 1 is purposely open-ended. In general, work here will pay off when you work on Problem 2. The availability of scalable atomic
operations will be of great value.
Problem 2: Improve Transactional Throughput on Gamay, Sirius,
and Ale Nodes
Next, you will leverage your concurrent operations from Problem 1 to create serializeable transactions. The provided code uses another
global lock to provide transactional access -- a sufficient solution for 2-way hyperthreaded machines, but grossly inadequate for the
agressively-threaded CMPs.
As with Problem 1, you are free to improve the design in anyway you
prefer, subject to the above set of rules. Remember: This level of
synchronization is intended to protect the semantic value of the map,
not necessarily the physical structure of the tree.
Bear in mind that, while you must provide a correct solution to
the transactional torture tests, you seek to optimize the
throughput test. Again, this problem is open-ended -- challenge yourself to
improve your throughput as much as you can.
Problem 3: Description of Synchronization Strategies
Describe the approaches you used on Problems 1 and 2. Provide neatly-drawn pictures where appropriate to illustrate tree locking.
Problem 4: Evaluation
Once your implementation passes all provided tests, report the
transaction throughput on an ale node, on sirius,
and on gamay in a table. Also include the throughput of the
provided implementation, and the speedup on each platform, in tabular
form.
Problem 5: Questions (Submission Credit)
- On which platform does the provided code perform best? Why do you suppose this is the case?
- On which platform does the provided code scale best? Why do you suppose this is the case?
- On which platform does your improved implementation perform best? Why do you suppose this is the case?
- On which platform does your improved implementation scale best? Why do you suppose this is the case?
- What, if any, additional testing did you use to debug your implementation?
- Which was more difficult: Problem 1 (ensuring atomicity of primitive operations) or Problem 2 (ensuring
serializability of transactions)?
- Of the transactions implemented in Transaction.C, which was the most troublesome
in Problem 2?
Tips and Tricks
Start early. You never know when the TA will disappear the night before the assignment is due.
Note: the Makefile for this homework uses GNU extensions. GNU Make is the default on Linux machines, but on gamay (which runs Solaris), you need to use the gmake command to build the project.
Set up RSA authentication on gamay to save yourself some keystrokes. HowTo
Make a plan before doing Problems 1 or 2. An ounce of planning is worth a pound of coding. Draw trees.
There's a handy GetThreadID() function packaged with the default implementation.
Start early.
Make use of the tests provided, and make your own if they're not sufficient for your needs.
Did we mention start early?
What to Hand In
Please turn this homework in on paper at the beginning of lecture.
The source code for all files that differ from the provided source
code.
Your description from Problem 3.
The table from Problem 4.
Answers to questions in Problem 5.
Important: Include your name on EVERY page.
|