a) Without forwarding, the pipeline state for one iteration looks something like this:
1 1 1 1 1 1 1 1 | 1 1
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 | 8 9
lw r1,0(r2) F D X M W |
addi r1,r1,#1 F s s D X M W |
sw r1,0(r2) F s s D X M W |
addi r2,r2,#4 F D X M W |
sub r4,r3,r2 F s s D X M W |
bnz r4,loop F s s D X M | W
--------------------------------------------------------------
lw r1,0(r2) F s s | F D X M W
Where F, D, X, M and W refer to the 5 pipeline stages from the simple DLX pipeline, and 's' means stall. As an example, the 2nd instruction ('addi') has to stall for the writeback of r1 from the 1st 'lw' (though the decode can happen in the same stage as the writeback since they've assumed the write happens early and the read late).
Since there are 99 iterations of the loop, and each takes 17 cycles, except for the last which takes 18, the whole thing should take 99 * 17 + 1 = 1684 cycles.
If you assumed that the branch target was known after decode, then you could fetch the next iteration of the loop at cycle 16, so the whole thing would take 99 * 15 + 3 = 1488 cycles.
b) With forwarding, the pipeline state for one iteration:
1 | 1 1 1 1 1
1 2 3 4 5 6 7 8 9 0 | 1 2 3 4 5
lw r1,0(r2) F D X M W |
addi r1,r1,#1 F D s X M W |
sw r1,0(r2) F s D X M W |
addi r2,r2,#4 F D X M W |
sub r4,r3,r2 F D X M W |
bnz r4,loop F D X M | W
------------------------------------------------
lw r1,0(r2) F m m | F D X M W
Here 'm' represents cycles taken by misprediction. Only one data hazard stall remains, from the initial load.
Since there are 99 iterations of the loop, each taking 10 cycles, except for the last which takes 11, the new version takes 99 * 10 + 1 = 991 cycles.
The answer book's minimal example which achieves this goal:
multd f0, f2, f4 ; writes f0 at cycle 6
nop
nop
ld f6, 0(r1) ; writes f6 at cycle 6
(Actually, I don't have H&P w/ me but I seem to remember that floating point ops all had *3* cycles execution latency...in which case, get rid of one of the 'nop's, and both will write at cycle 5.)
b) The earliest p9 could be freed is after L1 *retires*, since there is no other instruction that would need the value in p9 after L1.
Some people claimed that p9 could be freed after L1 has *read* the value of p9. There are two problems with that policy:
So we need to know that L1 is not speculative and it and all the instructions before it in the active list will not cause any exception before we free p9. I would argue that if there were any appreciable difference between *that* time and the time that L1 retires, the people at MIPS would have exploited that difference in *their* register freeing scheme!
c) How would you implement part (b)?
Most of the schemes that people came up with that I believed could work required compiler involvement: a compiler could easily *conservatively* determine whether a read is definitely the last read (not the same as determining all last reads...), and if it was convey this knowledge to hardware through some sort of ISA extension. Then when the associated instruction was retired, the operand could be freed. In addition to extending the ISA this implies carrying more information in the active list about *source* registers (currently there's only information about destination regs).
There was only one proposed all-hardware solution that I was convinced would work. Many people proposed looking at the active list or even the instruction stream to determine whether there were future reads and/or writes. But they seemed to forget that in both cases the visible instructions are potentially speculative. For instance, suppose in the example the second instruction were a conditional branch to L4:
L1: add r3, r2, r1 ## where r3 is the destination L2: bne r3, L4 L3: and r1, r5, r3 L4: or r4, r3, r1 L5: xor r1, r5, r2
(I don't know if I have the MIPS syntax correct, but you get the idea.) If we speculate that the branch was untaken, the write from L3 appears both in our instruction stream and in the active list. But it's not safe to assume that the write will occur!
So the correct solution made sure that we don't look beyond branches, in effect limiting the scope of the optimization to a basic block.
You can actually do slightly better by not looking beyond *speculative* branches (there's a bit associated w/ each instruction, somewhere(!), which indicates whether it is speculative or not), but that probably doesn't buy you a whole lot in practice.
a) Vector instructions will be included in future ISAs because:
There *are* applications aside from scientific applications which can benefit from vectors, ie. multimedia. Also, Asanovic mentions some success in speeding up SPECint applications which are known to be NOT vectorizable.