Homework 3 // Due at 1PM Friday, October 2nd (48 points)
You should do this assignment on your own, although you are welcome to talk to classmates in person or on Piazza about any issues you may have encountered. The standard late assignment policy applies -- you may submit up to 1 day late with a 10% penalty.
The purpose of this assignment is to give you experience with pipelined CPUs. You will simulate a given program with TimingSimple CPU to understand the instruction mix of the program. Then, you will simulate the same program with a pipelined in-order CPU to understand how the latency and bandwidth of different parts of pipeline affect performance. You will also be exposed to pseudo-instructions that are used for carrying out functions required by the underlying experiment. This homework is based on exercise 3.6 of CA:AQA 3rd edition (the former textbook for this course) and was developed in part by Jason Lowe-Power, Nilay Vaish, and David Wood, then modernized by me (Matt Sinclair) and Jason Lowe-Power.
1. The DAXPY loop (double precision aX + Y) is an oft used operation in programs that work with matrices and vectors. The following code implements DAXPY in C++14.
#include <cstdio>
#include <random>
int main()
{
const int N = 1000;
double X[N], Y[N], alpha = 0.5;
std::random_device rd; std::mt19937 gen(rd());
std::uniform_real_distribution<> dis(1, 2);
for (int i = 0; i < N; ++i)
{
X[i] = dis(gen);
Y[i] = dis(gen);
}
// Start of daxpy loop
for (int i = 0; i < N; ++i)
{
Y[i] = alpha * X[i] + Y[i];
}
// End of daxpy loop
double sum = 0;
for (int i = 0; i < N; ++i)
{
sum += Y[i];
}
printf("%lf\n", sum);
return 0;
}
Your first task is to compile this code statically and simulate it with gem5 using the TimingSimple CPU. In your report, report the breakup of instructions for different op classes -- and provide a brief analysis of the breakdown. For this, grep for op_class in the file stats.txt.
2. Generate the assembly code for the DAXPY program above by using the -S and -O3 options when compiling with g++. As you can see from the assembly code, instructions that are not central to the actual task of the program (computing aX + Y) will also be simulated. This includes the instructions for generating the vectors X and Y, summing elements in Y, and printing the sum. When I compiled the code with -S, I got about 320 lines of assembly code with O2 and 500 lines of assembly code with O3, with only about 15-20 lines for the actual DAXPY loop.
Usually while carrying out experiments for evaluating a design, one would like to look only at statistics for the portion of the code that is most important. This part of the code is also known as the region of interest. To look only at the region of interest, typically programs are annotated so that the simulator, on reaching the beginning of an annotated portion of the code, will carry out functions like create a checkpoint, output, and reset statistical variables. By doing this, it ensures that our stats are representative for the region of the code we care about, instead of mixing these stats in with the stats for parts we aren't focused on (e.g., generating the vectors in DAXPY).
To learn how to reset the stats in gem5, you will edit the C++ code from the first part to output and reset stats just before the start of the DAXPY loop and just after it. For this, include the file m5op.h in the program (you will find this file in include/gem5 directory of the gem5 repository). Use the function m5_dump_reset_stats() from this file in your program. This function outputs the statistical variables and then resets them. You can provide 0 as the value for both the delay and the period arguments. If you want to learn more about m5ops, here is a good place to start (NOTE: the above linked documentation for m5ops suggests using them in the "old" way that is no longer supported, so caveat emptor).
To provide the definition of the m5_dump_reset_stats(), go to the directory $GEM5_ROOT/util/m5/src/x86 and edit the SConsopts in the following way:
--- a/util/m5/src/x86/SConsopts
+++ b/util/m5/src/x86/SConsopts
@@ -27,7 +27,7 @@ Import('*')
env['VARIANT'] = 'x86'
get_variant_opt('CROSS_COMPILE', '')
-env.Append(CFLAGS='-DM5OP_ADDR=0xFFFF0000')
+#env.Append(CFLAGS='-DM5OP_ADDR=0xFFFF0000')
env['CALL_TYPE']['inst'].impl('m5op.S')
env['CALL_TYPE']['addr'].impl('m5op_addr.S', default=True)
Execute the command 'scons build/x86/out/m5' in the $GEM5_ROOT/util/m5/ directory. This will create an object file named m5op.o (in $GEM5_ROOT/util/m5/build/x86/x86/) and another object file named m5_mmap.o (in $GEM5_ROOT/util/m5/build/x86/). Link these files with the program for DAXPY (compile with g++). Now again simulate the program with the TimingSimple CPU. This time you should see three sets of statistics in the file stats.txt. In your report, report the breakup of instructions among different op classes for the three parts of the program. Provide the fragment of the generated assembly code that starts with call to m5_dump_reset_stats() and ends with another call to m5_dump_reset_stats(), with the main DAXPY loop in between.
More information on m5ops, for those that are interested. Technically, the above change is only necessary for the KVM CPU model, which we are not using for this assignment. This matters for KVM because the M5OP_ADDR line is used to tell the (guest) binary to access the m5ops through memory-mapped I/O instead of magic instructions (the m5ops are known as magic instructions because they are specific, hardcoded instructions that the simulator looks for, and they have "magic" behavior to do things like reset counters when called). Since the KVM CPU runs parts of the simulation on real hardware, it doesn't know about magic instructions. Hence, using memory-mapped I/O is necessary to get it working with m5ops.
Alternative: If you want to link to a static version of the m5ops library instead, you can do the following:
cd gem5/util/m5 && scons build/x86/out/libm5.a
make
This essentially replaces the step that creates m5op.o and m5_mmap.o.
3. As the tutorial with Homework 1 discussed, there are several different types of CPUs that gem5 supports: atomic, TimingSimple, out-of-order, in-order and KVM. Let's talk about the timing and the in-order CPUs. The TimingSimple CPU executes each arithmetic instruction in a single cycle, but requires multiple cycles for memory accesses. Also, it is not pipelined. So only a single instruction is being worked upon at any time. The in-order CPU (also known as MinorCPU) executes instructions in a pipelined fashion with the following pipe stages: fetch1, fetch2, decode and execute. Remember, as discussed in Homework 1, you must add MinorCPU to the command line to get it to compile.
Especially if you didn't already for Homework 1, take a look at the file MinorCPU.py. In the definition of MinorFU, the class for functional units, we define two quantities opLat and issueLat. From the comments provided in the file, understand how these two parameters are to be used. Also note the different functional units that are instantiated as defined in class MinorDefaultFUPool.
Assume that the issueLat and the opLat of the FloatSimdFU can vary from 1 to 6 cycles and that they always sum to 7 cycles (e.g., issueLat = 1, opLat = 6, 1+6=7). For each decrease in the opLat, we need to pay with a unit increase in issueLat (e.g., issueLat = 2, opLat = 5). In your report, answer: which design of the FloatSimd functional unit would you prefer? Provide statistical evidence obtained through simulations of the annotated portion of the code.
You can find a skeleton file that extends the minor CPU here: http://pages.cs.wisc.edu/~sinclair/courses/cs752/fall2020/handouts/hw/hw3/cpu.py (You can also get this file, if you are logged into a CSL machine, from: /u/s/i/sinclair/public/html-s/courses/cs752/fall2020/handouts/hw/hw3/cpu.py). If you use this file, you will have to modify your config scripts to work with it. Also, you'll have to modify this file to support the next part.
4. The Minor CPU has by default two integer functional units as defined in the file MinorCPU.py (ignore the Multiplication and the Division units). Assume our original Minor CPU design requires 2 cycles for integer functions and 4 cycles for floating point functions. In our upcoming Minor CPU, we can halve either of these latencies. In your report, answer: Which one should we go for? Provide statistical evidence obtained through simulations.
What to Hand In
-
Create an archive (.zip, .gz, or .tgz) of the following files:
-
A file named daxpy.cpp which is used for testing. This file should also include the pseudo-instructions (m5_dump_reset_stats()) as asked in part 2. Also provide a file daxpy.s with the fragment of the generated assembly code as asked for in part 2.
-
Any Python files you used to run your simulations.
-
stats.txt and config.ini files for all the simulations, appropriately named to convey which file is from which run.
-
The Makefile you used to compile your benchmark.
-
Additionally, separate from the above archive, create a file named report.pdf that contains a short report (400 words) with answers to the above questions.
-
Submit your archive and report to Canvas.
Grading Breakdown
Total Points: 48
-
Stats files (20 points total): Each stats file is worth 1 point if it is submitted, 0 otherwise.
-
daxpy.cpp (2 points)
-
cpu.py (4 points) (or equivalent script(s) if you chose not to modify this one)
-
Makefile (2 points)
-
Report (20 points): Each of the 4 questions is worth 5 points. Partial credit will be given for answers that do not fully answer the question.
|