CS/ECE 758 Fall 2007

CS/ECE 758 Advanced Topics in Computer Architecture
Programming Current and Future Multicore Processors
Fall 2007 Section 1
Instructor David A. Wood and T. A. Dan Gibson
URL: http://www.cs.wisc.edu/~david/courses/cs758/Fall2007/

Homework 6 // Due at Lecture Thursday, October 25, 2007

You will perform this assignment on a simulated SPARCv9-based architecture that supports Hardware Transactional Memory. You will setup and run the simulator, Multifacet GEMS 2.0, on the x86-64 clover nodes, clover-01.cs.wisc.edu and clover-02.cs.wisc.edu. You will build your workloads on a real SPARCv9 machine, chianti.cs.wisc.edu.

You may find the GEMS 2.0 TM documenation marginally useful.

You should do this assignment alone. No late assignments.

Purpose

The purpose of this assignment is to give you some experience converting lock-based synchronization into transactions, and to introduce the simulation capabilities of Multifacet GEMS 2.0, which you may want to use for your class projects.

Programming Environment: POSIX Threads + LogTM

Once again, threads in this homework are of the POSIX flavor.

As in HW5, the orchestration and creation/destruction of threads has been done for you, as you will be re-using most of your code from HW5. The burden of multiple platforms, however, is removed -- your code need only run correctly under simulation.

The target machine is functionally simulated by a tool called Simics. Simics performs full-system simulation -- right down to the boot PROM -- but doesn't model any realistic timing. Timing the run is the role of Ruby, the memory system simulator from GEMS. Ruby also implements transactional memory -- so your Transactional code won't be atomic unless Ruby is loaded.

Programming Task: Concurrent Binary Tree, Reloaded

This homework re-uses your lock-based implementation of a concurrent binary tree from HW5. You will modify this code to use LogTM for synchronization instead of locks for many cases.

Problem 1: Simulator Setup

This section is a brief guide on how to download, setup, and install GEMS. By the end of this list, you will be sufficiently familiar with GEMS to run your transactional binary tree as a GEMS microbenchmark. Be sure to perform these steps on a clover machine!

Register to download GEMS 2.0, and download gems-release2.0.tar.gz via this link. Save the tarball wherever you wish.
Download a tarball of Simics 2.2.19 here (UW IPs only). Save the tarball wherever you wish.
Follow the GEMS installation WIKI documentation here. Do not proceed to the QuickStart page.

Be sure to use amd64-linux as your host type, not x86-linux as is shown on the setup page.
You may optionally decide not to set up opal, the out-of-order processor simulator in GEMS -- it will not be used in this homework.

Build ruby. Change directory to $GEMS/ruby, then invoke
make -j 8 PROTOCOL=MESI_CMP_filter_directory DESTINATION=MESI_CMP_filter_directory
Set some environment variables. You should set these each time you will run the simulator. The following uses csh-style syntax.
setenv SIMICS_EXTRA_LIB ./modules
setenv VTECH_LICENSE_FILE /p/multifacet/projects/simics/licenses/license.dat
Copy the $GEMS/microbenchmarks directory to chianti:
cd $GEMS/microbenchmarks
scp -r * joeuser@chianti:cs758/microbenchmarks/
SSH to chianti, CD to the microbenchmarks/transactional/common directory, and change the definition of CC in Common.Makedefs to /usr/sfw/bin/gcc.
cd cs758/microbenchmarks/transactional/common
vi Common.Makedefs
While you're here, notice the files transaction.[ch]. You'll be using these for Problem 2.
CD to the microbenchmarks/transactional/deque directory, and build the double-ended queue microbenchmark.
cd ../deque
make
You should see no errors or warnings of any kind.
SCP the binaries (deque_Lock and deque_TM) back to the AFS world. Do not change the names of the binaries.
scp deque_Lock joeuser@clover-02:
scp deque_TM joeuser@clover-02:
Back on the clover nodes, copy or move the binaries to $GEMS/microbenchmarks/transactional/deque.
cp ~/deque_Lock $GEMS/microbenchmarks/transactional/deque
cp ~/deque_TM $GEMS/microbenchmarks/transactional/deque
Now, we can configure the LogTM features of the simulated target machine. Start off by CDing to $GEMS/gen-scripts and open config.py in your favorite editor.
First, we're going to use only perfect signatures for this assignment. Uncomment filter_config_list.append("Perfect_") on line 51, and comment out all other appends to filter_config_list, on lines 55-123. Comments begin with a # in python.
For now, we're only interested in running the transactional memory binary, so we'll leave lock_type_list.append("Lock") on line 139 commented out -- you'll want to remove this comment for Problem 2.
We can select our coherence protocol by modifying protocol_list. Comment out all protocols except ["MESI_CMP_filter_directory", ee_base_pred] on line 169.
We've provided a two-processor and a 32-processor target machine, but we'll run the two-processor case only, for now. Make sure the only appended processor count for procs_per_chip_list is 2, on line 200.
Next, we want to tell the simulator which of its many microbenchmarks we want to run. Comment out lines 252-257 and 259-273, leaving only microbenchmark_list.append(("deque", 1024, 1, "1024ops-32bkoff", "1024 32")).
One last thing to notice in microbench.py: Ruby has a 'fast' mode that allows transactional semantics without doing any timing simulation. In essence, this 'fast' mode allows you to debug your transactional code. You might want to learn how to turn it on (and off) now... search for a commented line in microbench.py that sets enable_tourmaline to 1, near or at line 209.
For a bitter diatribe on Tourmaline, email Dan.
Save and close config.py.
Open microbench.py.
Comment out line 157, and remove the comment from line 158. This selects a rather unrealistic point-to-point interconnect for our hypotheical CMP.
Run gen-scripts.py. A script to run Simics/GEMS will be produced in $GEMS/results/scripts.
./gen-scripts.py
CD back to $GEMS/microbenchmarks/transactional/deque and open deque.simics in your favorite editor.
Comment out line 29, and add the following line in its place:
@mfacet.run_sim_command('read-configuration "/panfs/panasas-01.cs.wisc.edu/scratch/cs758/silver/simics-2.X/2p/silver-2p.check"' ) You will eventually use @mfacet.run_sim_command('read-configuration "/panfs/panasas-01.cs.wisc.edu/scratch/cs758/silver/simics-2.X/32p/silver-32p.check"' ) for 32-processor simulations.
While you're here, notice lines 36-43, where command_lines is defined. These are the commands that will be issued to the target machine. You will copy this file and modify these lines as part of Problem 2. Notably, line 39 copies the binaries from the host machine to the target (you will change "deque" to "cTree", here), and line 41 executes the binary (here you will change "./deque %d %s" to "./cTree" and remove everything after the non-quoted %). Don't make these changes, now, however.
CD to $GEMS/results/scripts/ and run the deque-TM script there.
cd $GEMS/results/scripts/
./deque-TM-1024ops-32bkoff-1c-2p-1t-default-EagerCD_EagerVM_Base_Pred-Perfect_-Perfect_-Perfect_-10000.sh
If you see error/assertions, type 'q' to exit simics and re-run the script.
When the simulation is done, check out the stats in $GEMS/results/deque-TM-1024ops-32bkoff. Report the total Ruby_cycles, the number of L2 misses (Total_misses), and the number of instructions executed (instruction_executed) in a table.

Problem 2: Transactionalize your Concurrent Tree

The goal of problem 2 is to make a version of the concurrent tree microbenchmark that uses transactions instead of locks to provide atomic tree operations.

Remember transaction.[ch] from Problem 1's Step 7? Add them to your fine-grained lock implementation on chianti, and be sure to change your code to compile with CC=/usr/sfw/bin/gcc and with -Wa,-xarch=v8plusa -DSIMICS passed to GCC.

Update
You'll need to use file transaction.C in place of the provided GEMS transaction file, because of the C/C++ language barrier. Compile this file normally with /usr/sft/bin/g++ rather than gcc.

Refer to the source for the deque microbenchmark to see the transactional memory syntax.

Use transactional memory to replace your locks for the Lookup(), Set(), and Remove() interfaces. You may optionally use TM to synchronize the 'transactional' accesses to the tree, though the caveats below may prevent you from doing so.

Update
If you have issues with libraries on the target, have a look at this addendum.

Transactional dos and don'ts

This list arises partly due to some interesting semantics of transactional memory, and also due to some interesting "features" of the simulator.

Don't start Ruby too early. Insert calls to transation.c's Barrier_breaking() function in testTreeThroughput() (Tests.C) in lieu of HW5's barrier. Be sure to use this barrier AFTER the tree has been loaded. When Ruby loads, transactional memory will work but simulation will be very slow. Insert another Barrier_breaking() call at the end of while loop.
Do expect simulation to take a long, long, long time. One second of simulated time = 10,000 seconds simulation time (hence why we load Ruby at the very last possible instant).
Don't mix locks or condition variables with transactions. Unexpected and strange behavior can and will result. Rabid hobbits will be the least of your problems.
Do reduce INTER_ATOMIC_SLEEP_TIME and INTER_TRANSACTION_SLEEP_TIME to 1 and 50, respectively in Transactions.h. Play around with these values further, if you wish -- the goal of this assignment is to think about concurrency, after all.
Do report numbers only for the throughput tests.
Don't do I/O of any kind in a transaction, and avoid calls to new/delete/malloc/free like the plague.
Do reduce the number of transactions (in Tests.h) to 10, to keep simulation time down.

Running your Transactional Code

In order to run your transactional code:

Set up a directory in $GEMS/microbenchmarks/transactional called ctree.
Move your lock-based binary to that directory (name it cTree_Lock), as well as your TM-based binary (call it cTree_TM).
Copy deque.simics from $GEMS/microbenchmarks/transactional/deque to $GEMS/microbenchmarks/transactional/cTree/cTree.simics.
Edit $GEMS/microbenchmarks/transactional/cTree/cTree.simics. Make the changes described above in Problem 1's Step 22. (change other occurrences of "deque" to "cTree" in the file)
Edit config.py again, and modify the microbenchmark_list to include the following line:
microbenchmark_list.append(("ctree", 1, 1, "ctree", ""))
Comment out the deque line.
Allow scripts for both lock-based and transactional code to be produced by un-commenting line 139, that reads:
lock_type_list.append("Lock")
Re-run gen-scripts.py
You will now find scripts to run your transactional tree in $GEMS/results/scripts.

Problem 3: Description of Synchronization Strategies

Describe where you used transactions to synchronize your code. Describe where you felt transactions weren't appropriate.

Problem 4: Evaluation

Report the simulated target's runtime (in Ruby_cycles), as well as the number L2 cache misses (total_misses) for your lock-based and transaction-based code, in a table (and optionally a graph).

Problem 5: Questions (Submission Credit)

How many different transactions did you use? What is the purpose of each?
Which implementation -- lock-based or transaction-based -- performed best?
Which was easier overall -- writing the transactional version, or writing the fine-grained locking version? Include estimates of coding and debugging time.
How long did it take to restructure the code to avoid I/O, and calls to the memory allocator?
How many aborted transactions did you observe? What was the transaction success rate ( a number between 0 and 1 )?

Tips and Tricks

Start early. You never know when the TA will disappear the night before the assignment is due. The road to correctly-functioning simulator is riddled with obscure error messages.

Set up RSA authentication on chianti to save yourself some keystrokes. HowTo

Start early.

Only use the throghput tests under simulation -- everything else will take WAY too long.

Be aware: if Ruby isn't loaded, your transactions aren't atomic.

Did we mention start early?

If you refuse to start early on principle, at least do Problem 1 reasonably early...

What to Hand In

Please turn this homework in on paper at the beginning of lecture.

Your table from Problem 1.

Printouts of your critical sections (transactions) from Problem 4, annotated to indicate their purpose (if it is not obvious).

The description requested in Problem 3 (with your annotations, if you wish).

Your results from Problem 4.

Answers to questions in Problem 5.

Important: Include your name on EVERY page.

Computer Sciences | UW Home

Page last modified: Tuesday, 23-Oct-2007 10:51:22 CDT
Feedback or content questions: send email to Email Address of Mark Hill
Technical or accessibility issues: lab@cs.wisc.edu
Copyright © 2002, 2003 The Board of Regents of the University of Wisconsin System.

CS/ECE 758 Advanced Topics in Computer Architecture Programming Current and Future Multicore Processors Fall 2007 Section 1 Instructor David A. Wood and T. A. Dan Gibson URL: http://www.cs.wisc.edu/~david/courses/cs758/Fall2007/

Programming Current and Future Multicore Processors