University of Wisconsin - Madison

CS/ECE 752: Advanced Computer Architecture I

Fall 1999 Offering

Instructor Mark D. Hill and Teaching Assistant Collin McCurdy

Homework #3 Solutions


Problem 1: H&P 3.1

For these problems I was mainly interested in whether you got the pipeline for one iteration of the loop correct...

a) Without forwarding, the pipeline state for one iteration looks something like this:

					1 1 1 1 1 1 1 1 | 1 1
		      1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 | 8 9
     lw   r1,0(r2)    F D X M W                         |
     addi r1,r1,#1      F s s D X M W                   |
     sw   r1,0(r2)            F s s D X M W             |
     addi r2,r2,#4                  F D X M W           |
     sub  r4,r3,r2                    F s s D X M W     |
     bnz  r4,loop                           F s s D X M | W
     --------------------------------------------------------------
     lw   r1,0(r2)                                F s s | F D X M W

Where F, D, X, M and W refer to the 5 pipeline stages from the simple DLX pipeline, and 's' means stall. As an example, the 2nd instruction ('addi') has to stall for the writeback of r1 from the 1st 'lw' (though the decode can happen in the same stage as the writeback since they've assumed the write happens early and the read late).

Since there are 99 iterations of the loop, and each takes 17 cycles, except for the last which takes 18, the whole thing should take 99 * 17 + 1 = 1684 cycles.

If you assumed that the branch target was known after decode, then you could fetch the next iteration of the loop at cycle 16, so the whole thing would take 99 * 15 + 3 = 1488 cycles.

b) With forwarding, the pipeline state for one iteration:

					1 | 1 1 1 1 1
		      1 2 3 4 5 6 7 8 9 0 | 1 2 3 4 5
     lw   r1,0(r2)    F D X M W           |
     addi r1,r1,#1      F D s X M W       |
     sw   r1,0(r2)        F s D X M W     |
     addi r2,r2,#4            F D X M W   |
     sub  r4,r3,r2              F D X M W |
     bnz  r4,loop                 F D X M | W
     ------------------------------------------------
     lw   r1,0(r2)                  F m m | F D X M W

Here 'm' represents cycles taken by misprediction. Only one data hazard stall remains, from the initial load.

Since there are 99 iterations of the loop, each taking 10 cycles, except for the last which takes 11, the new version takes 99 * 10 + 1 = 991 cycles.

Problem 2: H&P 4.8

Think everyone got this: the idea was to make both implementations produce two results (from different FUs) at the same time. The scoreboard implementation can write the register file for both, while the common data bus is a structural hazard for the Tomasulo implementation.

The answer book's minimal example which achieves this goal:

     multd f0, f2, f4   ; writes f0 at cycle 6
     nop
     nop
     ld    f6, 0(r1)    ; writes f6 at cycle 6

(Actually, I don't have H&P w/ me but I seem to remember that floating point ops all had *3* cycles execution latency...in which case, get rid of one of the 'nop's, and both will write at cycle 5.)

Problem 3: MIPS R10K

a) The R10K would free p9 after L3 retires, since L3 creates a new mapping, which is temporary until it is known that L3 hasn't been issued speculatively (and isn't after an exception).

b) The earliest p9 could be freed is after L1 *retires*, since there is no other instruction that would need the value in p9 after L1.

Some people claimed that p9 could be freed after L1 has *read* the value of p9. There are two problems with that policy:

  1. L1-L5 could have been issued speculatively, and may need to be backed out of. It would be *bad* if p9 were freed, assigned again and then written before it was discovered that the wrong path had been taken...

  2. The same argument holds for exceptions: L1 could potentially cause an exception after its operands have been read, and therefore might need to be re-executed. Again, it would be bad if p9 had been freed, assigned and written before the exception occurred.

So we need to know that L1 is not speculative and it and all the instructions before it in the active list will not cause any exception before we free p9. I would argue that if there were any appreciable difference between *that* time and the time that L1 retires, the people at MIPS would have exploited that difference in *their* register freeing scheme!

c) How would you implement part (b)?

Most of the schemes that people came up with that I believed could work required compiler involvement: a compiler could easily *conservatively* determine whether a read is definitely the last read (not the same as determining all last reads...), and if it was convey this knowledge to hardware through some sort of ISA extension. Then when the associated instruction was retired, the operand could be freed. In addition to extending the ISA this implies carrying more information in the active list about *source* registers (currently there's only information about destination regs).

There was only one proposed all-hardware solution that I was convinced would work. Many people proposed looking at the active list or even the instruction stream to determine whether there were future reads and/or writes. But they seemed to forget that in both cases the visible instructions are potentially speculative. For instance, suppose in the example the second instruction were a conditional branch to L4:

	 L1: add r3, r2, r1  ## where r3 is the destination
	 L2: bne r3, L4
	 L3: and r1, r5, r3
	 L4: or  r4, r3, r1
	 L5: xor r1, r5, r2

(I don't know if I have the MIPS syntax correct, but you get the idea.) If we speculate that the branch was untaken, the write from L3 appears both in our instruction stream and in the active list. But it's not safe to assume that the write will occur!

So the correct solution made sure that we don't look beyond branches, in effect limiting the scope of the optimization to a basic block.

You can actually do slightly better by not looking beyond *speculative* branches (there's a bit associated w/ each instruction, somewhere(!), which indicates whether it is speculative or not), but that probably doesn't buy you a whole lot in practice.

Problem 4: Vectors

Since I didn't take many points off for this section, I've probably lost my readership by now...but I like vectors, so I'll mumble on for a bit:

a) Vector instructions will be included in future ISAs because:

a) Vector instructions will NOT be included in future ISAs because: