CS/ECE 752 Fall 2021 Homepage

Homework 4 // Due at 11AM Friday, November 5th (76 points)

You should do this assignment on your own, although you are welcome to talk to classmates in person or on Piazza about any issues you may have encountered. The standard late assignment policy applies -- you may submit up to 1 day late with a 10% penalty.

The purpose of this assignment is to give you experience with running gem5 in CHTC, a large scale computing system that uses the HTCondor batch submission service, on campus. In Homework 2, the input sizes and program you ran were intentionally made small, to make it easy to complete the simulations in a short amount of time. However, these input sizes are not realistic for the kinds of tests you'd need to run to submit a conference paper. Unfortunately, running all of your simulations on a single CPU (which you likely did for Homework 2) is not practical when each simulation takes multiple hours. This is why architects often turn to batch submission software, such as HTCondor (and large scale computing systems like CHTC), to run many jobs in parallel! This assignment was created by Kyle Roarty and Matt Sinclair. Thus, although we have tested it and believe it to be bugfree, please contact us if you find anything amiss.

Accordingly, the goal of this assignment is to take a setup similar to that of HW2, and measure how it performs for much larger input sizes, more applications, and multiple CPU configurations on CHTC. By carrying out this assignment in gem5 + CHTC, you will be able to (a) study how each application's performance varies as input size and CPU configuration varies and (b) demonstrate how to get your simulator runs to execute in the CHTC submission software. As before, I recommend using the stable branch of gem5 -- note that on the tutorial page we explain how to automate most of this with a docker we've created for you. You are welcome to do all of this outside of the docker, but will need to work through some additional setup issues.

Note that while it is technically possible to complete most of this assignment without CHTC (e.g., by running them all sequentially on a CPU), I’ve attempted to size the inputs and configurations such that this would take such a long time that this approach would not be practical.

gem5 in CHTC Tutorial: We have created a tutorial on how to run gem5 in CHTC. I strongly recommend going through this before runnning additional experiments.

If you are unfamilar with CHTC, I strongly recommend you consider watching some of the tutorial videos the CHTC staff has put together (see Canvas).

System Setup: For all experiments, you should assume your system has a single CPU and uses a 2-level cache hierarchy like you created in Homework 1. The L1 cache should be 8-way set associative and 16 KB large (both I$ and D$), while the L2 cache should be unified, 16-way set associative, and 128 KB. For main memory, you should use HBM_1000_4H_1x64 as in Homework 1.

Applications: The tests we'll use to test our simulations are all based on the DAXPY (double precision aX + Y) program you ran for Homework 2. However, note that the below code is different in one important respect: now the input size (N) is based on a passed in input value. Moreover, in addition to DAXPY, you will also be responsible for changing the DAXPY code to run several other variants:

DAX: double precision aX
IAXPY: integer precision aX + Y. For iaxpy and iax, use alpha = 2, and use rand() instead for X and Y.
IAX: integer precision aX
SAXPY: single (float) precision aX + Y
SAX: single (float) precision aX

Similar to Homework 2, the following code implements DAXPY in C++14.

	    #include <cstdio>
	    #include <stdlib.h>
	    #include <random>

	    int main(int argc, char * argv[])
	    {
    	      const int N = atoi(argv[1]);
	      double X[N], Y[N], alpha = 0.5;
	      std::random_device rd; std::mt19937 gen(rd());
	      std::uniform_real_distribution<> dis(1, 2);
	      for (int i = 0; i < N; ++i)
	      {
	        X[i] = dis(gen);
	        Y[i] = dis(gen);
	      }

	      // Start of daxpy loop
	      for (int i = 0; i < N; ++i)
	      {
	        Y[i] = alpha * X[i] + Y[i];
	      }
	      // End of daxpy loop

  	      double sum = 0;
  	      for (int i = 0; i < N; ++i)
	      {
	        sum += Y[i];
	      }
	      printf("%lf\n", sum);
	      return 0;
	    }

1. Your first task is to compile each application and simulate it with gem5 using the MinorCPU (in CHTC). For each application you should use an input size of 65536. This will also require you to learn how to compile your applications using HTCondor (see tutorial). In your report, for the region of interest only, report the breakup of instructions for different op classes for each application and provide a brief analysis of the breakdown across all 6 applications. For this, as before, grep for the appropriate stats class for MinorCPU in the stats.txt file. Note: the tutorial intentionally does not cover how to get the m5ops working -- you should spend some time thinking about how to apply what you learned in the tutorial and HW2 to get m5ops working (your submission must use HTCondor to compile with m5ops, not compile it on the submission node). When compiling for m5ops, if you get a failure about the TERM not being set, add the following line to your HTCondor sub file:

environment = "TERM=xterm-256color"

2. Next, we want to compare MinorCPU to O3CPU (see tutorial for compilation), a more complex, out-of-order processor that is much more realistic relative to modern processors. Thus, in this step, you should simulate each application with O3CPU in gem5, in CHTC. Note that this will require that you compile gem5 with O3CPU (in CHTC) if you haven't already. In your report, again for the region of interest only, compare and contrast how many cycles it takes MinorCPU and O3CPU to simulate each of the 6 applications. Why do you think one CPU is faster than the other one? Provide statistical evidence based on your simulation results.

3. In your report, for the O3CPU runs you obtained in step 2, compare the performance of the region of interest across all 6 applications. Which applications perform the best? Why do you think they perform the best? Again, provide statistical evidence from your simulation results.

What to Hand In

Create an archive (.zip, .gz, or .tgz) of the following files:
1. cpp files: A file for each of the 6 applications you used for testing. Each file should include include the pseudo-instructions (m5_dump_reset_stats()) needed for problems 1-3. Filenames:
  - daxpy.cpp
  - dax.cpp
  - iaxpy.cpp
  - iax.cpp
  - saxpy.cpp
  - sax.cpp
2. Any Python files you used to run your simulations.
3. stats.txt and config.ini files for all the simulations, appropriately named to convey which file is from which run.
4. The CHTC submission and shell scripts you used for all of the simulations and compilations, appropriately named to convey which experiment(s) you ran with them.
Additionally, separate from the above archive, create a file named report.pdf that contains a 2-3 page report (single spaced, 10 point font) with answers to the above questions.
Submit your archive and report to Canvas.

Grading Breakdown

Total Points: 76

CHTC Submission (.sub) and shell (.sh) scripts per application run (12 points total): Each sub/sh file (for compiling the application and launching the job that runs the application in gem5 with m5ops) is worth 1/2 point if submitted, 0 otherwise. NOTE: There are ways to write a single .sub/.sh file pair that launches all of your jobs. The tutorial does not cover that, to keep things simple, but if you want to do that instead I will still give you all of the points assuming your scripts do what is needed to launch all of the jobs. See details on how to do this here.
CHTC Submission (.sub) and shell (.sh) scripts for compiling compiling m5ops (1 point total): The sub/sh file pair for compiling m5ops is worth 1 point if submitted, 0 otherwise.
C++ files (12 points total):
- daxpy.cpp (2 points)
- dax.cpp (2 points)
- iaxpy.cpp (2 points)
- iax.cpp (2 points)
- saxpy.cpp (2 points)
- sax.cpp (2 points)
CHTC Submission (.sub) and shell (.sh) scripts for compiling gem5 (2 points): 1 point each for these scripts if submitted, 0 otherwise.
Stats and config files (24 points total): Each stats file is worth 1 point if it is submitted, 0 otherwise (24 files, 12 for MinorCPU and 12 for O3 CPU).
Python script(s) to run gem5 (4 points): if possible, please name the main script hw4.py.
Report (21 points): Each of the 3 questions is worth 7 points. Partial credit will be given for answers that do not fully answer the question.

Additional Resources:

CHTC User Guides
Douglas Thain, Todd Tannenbaum, and Miron Livny. 1995. Distributed Computing in Practice: The Condor Experience.