Problem 1 (40 points)
H&P 4th ed. , Chapter 2, Case Study 1, except use the following
the following code using
notation from lecture notes:
loop: |
LD |
F2, 0(R1) |
|
LD |
F4, 0(R2) |
|
MULD |
F2, F0, F6 | # F6 is dest reg. |
|
ADDD |
F6, F4, F6 | # F6 is dest reg. |
|
LD |
F4, 100(R3) |
|
ADDD |
F6, F4, F2 | # F2 is dest reg. |
|
SD |
F2, 20000(R1) |
|
ADD |
R1, #8, R1 | # R1 is dest reg. |
|
ADD |
R2, #8, R2 | # R2 is dest reg. |
|
BLT |
R1, R4, loop |
And use the extra latencies in this table instead:
Latencies beyond a single cycle |
Memory LD |
+5 |
Memory ST |
+1 |
Integer ADD/SUB |
+0 |
Branches |
+2 |
ADDD |
+2 |
MULTD |
+5 |
Answer the following questions:
- Question 2.1 (8 points)
- Question 2.3 (8 points)
- Question 2.5 (8 points)
- Question 2.8 (8 points), but use code above not figure 2.38
- (8 points)
Continuing the problem above, unroll the loop once, merge the two
iterations, and schedule the instructions to eliminate as many hazards as
possible. Are there any remaining dependences?
Problem 2 (30 points)
In this problem you will examine the impact of the instruction set on the compiler and performance optimizations the
compiler can make. Consider the short program below:
#include <stdio.h>
int main(void) {
int i,C;
int A[50], B[50];
C = 0;
for (i = 0; i < 50; i++) {
C += A[i] * B[i];
}
return(C);
}
Type in this code and name this program dotp.c. Now compile it on pinot.cs.wisc.edu. You must use only this machine for this problem. Compile this program with gcc using the following command:
sh> gcc -O0 -S dotp.c
This will generate a file called dotp.s which will contain the assembly language source for the compiled program. You can skim this file to familiarize yourself with the Sun Sparc ISA. The main features of the ISA are:
-
Every instruction specifies operands first, destination last: opcode src1, src2, dst
- Registers are designated as %o and %g. Both are general purpose registers.
- Appendix J describes the Sparc ISA
We want to examine what types of optimization the compiler is capable of. By using -O0 we disabled most optimizations. We will now enable all optimizations and analyze the code. Compile this program with gcc using the following command:
sh> gcc -O4 -S dotp.c
Now look at dotp.s and submit the following:
>p>
- Add your comments to the assembly code to explain what each line is doing and turn in an annotated print out of the assembly listing. (20 points)
- How many instructions are executed dynamically while executing this optimized program? (10 points)
Problem 3) Introduction to Simplescalar (30 points)
(This is an important problem. The basic aim is to introduce you to simplescalar.)
Using a CS Unix/Linux machine, download and
install the Simplescalar v 3.0 Source code from
http://www.simplescalar.com/.
Descriptions for the Integer Benchmarks and Floating Point Benchmarks
can be found at the Standard Performance
Evaluation Corporation's website. To run the benchmarks, the command line arguments for each benchmark must be specified after the executables name. The command line options necessary for each benchmark are at /unsup/spec2000/benchspec/benchmark.cmdlines. From there, the associated Alpha executables and data files for the integer benchmarks and floating point benchmarks are in the CINT2000 and CFP2000 directories respectively. You will perform a simple characterization of gcc using sim-outorder.
Use the following basic configuration:
Two Level Cache:
L1 instruction cache: 16K 2-way set associative with 64 byte lines.
L1 data cache: 32K 4-way set associative with 32 byte lines.
L2 unified cache: 128K 8-way set associative unified L2 cache with 64 byte lines.
Branch prediction: use the default branch prediction settings.
Submit the following. Report the IPC and the cache hierarchy miss rates if:
- the simulator is run in in-order mode using LRU replacement policy for the cache hierarchy
- the simulator is run in out-of-order mode using LRU replacement policy for the cache hierarchy
- the simulator is run in out-of-order mode, changing the cache hierarchy to the following setting:
Two Level Cache using LRU replacement policy:
L1 instruction cache: 8K 4-way set associative with 64 byte lines.
L1 data cache: 16K 4-way set associative with 32 byte lines.
L2 unified cache: 128K 4-way set associative unified L2 cache with 64 byte lines.
For all of the above tests, fast-forward 100M instructions, and simulate for next 50M instructions. Use cc1_base executable and integrate.i input file for gcc.
|