UW-Madison
Computer Sciences Dept.

CS/ECE 752 Advanced Computer Architecture I Fall 2008 Section 1
Instructor David A. Wood and T. A. Khai Tran
URL: http://www.cs.wisc.edu/~david/courses/cs752/Fall2008/

Problem 1 (40 points)

H&P 4th ed. , Chapter 2, Case Study 1, except use the following the following code using notation from lecture notes:

loop: LD F2, 0(R1)
  LD F4, 0(R2)
  MULD F2, F0, F6 # F6 is dest reg.
  ADDD F6, F4, F6 # F6 is dest reg.
  LD F4, 100(R3)
  ADDD F6, F4, F2 # F2 is dest reg.
  SD F2, 20000(R1)
  ADD R1, #8, R1 # R1 is dest reg.
  ADD R2, #8, R2 # R2 is dest reg.
  BLT R1, R4, loop

And use the extra latencies in this table instead:
Latencies beyond a single cycle
Memory LD +5
Memory ST +1
Integer ADD/SUB   +0
Branches +2
ADDD +2
MULTD +5

Answer the following questions:

  1. Question 2.1 (8 points)
  2. Question 2.3 (8 points)
  3. Question 2.5 (8 points)
  4. Question 2.8 (8 points), but use code above not figure 2.38
  5. (8 points) Continuing the problem above, unroll the loop once, merge the two iterations, and schedule the instructions to eliminate as many hazards as possible. Are there any remaining dependences?

Problem 2 (30 points)

In this problem you will examine the impact of the instruction set on the compiler and performance optimizations the compiler can make. Consider the short program below:


#include <stdio.h>
int main(void) {
        int i,C;
        int A[50], B[50];
        C = 0;
        for (i = 0; i < 50; i++) {
                C += A[i] * B[i];
        }
	  return(C);
}

Type in this code and name this program dotp.c. Now compile it on pinot.cs.wisc.edu. You must use only this machine for this problem. Compile this program with gcc using the following command:

sh> gcc -O0 -S dotp.c

This will generate a file called dotp.s which will contain the assembly language source for the compiled program. You can skim this file to familiarize yourself with the Sun Sparc ISA. The main features of the ISA are:

  • Every instruction specifies operands first, destination last: opcode src1, src2, dst
  • Registers are designated as %o and %g. Both are general purpose registers.
  • Appendix J describes the Sparc ISA

We want to examine what types of optimization the compiler is capable of. By using -O0 we disabled most optimizations. We will now enable all optimizations and analyze the code. Compile this program with gcc using the following command:

sh> gcc -O4 -S dotp.c

Now look at dotp.s and submit the following: >p>

  • Add your comments to the assembly code to explain what each line is doing and turn in an annotated print out of the assembly listing. (20 points)
  • How many instructions are executed dynamically while executing this optimized program? (10 points)

Problem 3) Introduction to Simplescalar (30 points)

(This is an important problem. The basic aim is to introduce you to simplescalar.)

Using a CS Unix/Linux machine, download and install the Simplescalar v 3.0 Source code from http://www.simplescalar.com/. Descriptions for the Integer Benchmarks and Floating Point Benchmarks can be found at the Standard Performance Evaluation Corporation's website. To run the benchmarks, the command line arguments for each benchmark must be specified after the executables name. The command line options necessary for each benchmark are at /unsup/spec2000/benchspec/benchmark.cmdlines. From there, the associated Alpha executables and data files for the integer benchmarks and floating point benchmarks are in the CINT2000 and CFP2000 directories respectively. You will perform a simple characterization of gcc using sim-outorder.

Use the following basic configuration: Two Level Cache:

    L1 instruction cache: 16K 2-way set associative with 64 byte lines.
    L1 data cache: 32K 4-way set associative with 32 byte lines.
    L2 unified cache: 128K 8-way set associative unified L2 cache with 64 byte lines.
    Branch prediction: use the default branch prediction settings.

Submit the following. Report the IPC and the cache hierarchy miss rates if:

  1. the simulator is run in in-order mode using LRU replacement policy for the cache hierarchy
  2. the simulator is run in out-of-order mode using LRU replacement policy for the cache hierarchy
  3. the simulator is run in out-of-order mode, changing the cache hierarchy to the following setting:
      Two Level Cache using LRU replacement policy:
      L1 instruction cache: 8K 4-way set associative with 64 byte lines.
      L1 data cache: 16K 4-way set associative with 32 byte lines.
      L2 unified cache: 128K 4-way set associative unified L2 cache with 64 byte lines.

For all of the above tests, fast-forward 100M instructions, and simulate for next 50M instructions. Use cc1_base executable and integrate.i input file for gcc.

 
Computer Sciences | UW Home