CS/ECE 552 : Introduction to Computer Architecture
Spring 2006
Prof. Wood
Problem Set #4

Due: Wednesday, March 8 (in class)
Approximate Weight : 10% of homework grade

You should do this assignment alone


1.    In Verilog, create a register file that includes internal bypassing so that results written in one cycle can be read during the same cycle. Do this by writing an outer "wrapper" module that instantiates your existing (unchanged) register file module; your new module will just add the bypass logic. The list of inputs and outputs of the outer module should be the same as that of the inner module. Submit your Verilog source and your testing results.

Hint: Use assign statements for the logic that decides if bypassing is needed, and "? :" notation for muxes.
Hint: Not counting the header, your new module should be no more than about five or six lines long.

2.    Indicate all of the true, anti-, and output-dependences in the following segment of MIPS assembly code:

xor    $1, $2, $3
and    $4, $5, $6
sub    $7, $4, $5
add    $5, $1, $5
or     $4, $7, $4

3. For the code above, which of the dependences will manifest themselves as hazards in the pipeline in Figure 6.17 on page 395 of COD3e? How are these hazards resolved in this pipeline? Assuming the 'xor' instruction enters fetch (F) in cycle 1, in what cycle does the 'or' instruction enter writeback (W)? Show your work in a pipeline diagram. (Assume that the register file cannot read and write the same register in the same cycle and get the new data.)

4. How does your answer for question 3 change if you consider the pipeline in figure 6.36, on page 416 of COD3e? (Assume that the register file contains internal bypassing and can read and write the same register in the same cycle and get the new data.)

5. Consider the following code sequence and the datapath in figure 6.27 on page 404 of COD3e. Assuming the first instruction is fetched in cycle 1 and the branch is not taken, in which cycle does the 'and' instruction write its value to the register file? What if the branch IS taken? (Assume no branch prediction). Show pipeline diagrams.

        beq    $2, $3, foo
        add    $3, $4, $5
        sub    $5, $6, $7
        or     $7, $8, $9
foo:    and    $5, $6, $7

6. Redo question 5, but extend the datapath and control to support "predict-not-taken" static branch prediction.

7. Redo question 6, but using the datapath in figure 6.41 on page 427 of COD3e.

8. Consider the pipeline in Figure 6.27 on page 404; assume predict-not-taken for branches and assume a "Hazard detection unit" in the ID stage as shown on page 461. Can an attempt to flush and an attempt to stall occur simultaneously? If so, do they result in conflicting actions and/or cooperating actions? If there are any cooperating actions, how do they work together? If there are an conflicting actions, which should take priority? What would you do in the design to make sure this works correctly? You may want to consider the following code sequence to help you answer this question:

        beq $1, $2, TARGET  #assume that the branch is taken
        lw  $3, 40($4)
        add $2, $3, $4
        sw  $2, 40($4)
TARGET: or  $1, $1, $2

9. Consider a pipeline where branches are predicted not-taken, and a taken branch introduces three-cycle penalty. Suppose you are considering adding a delayed branch slot to your instruction set architecture, so that taken branches would only have a two-cycle penalty. Consider the following three fragments of code:

Fragment 1:

        movei $3, 10  // End-of-loop count
        .
        .
        .
        addi $2, $2, 1
        beq $2, $3, Target
        .
        .
        .
Target: ...

Fragment 2:

        add $2, $2, $8
        beq $2, $3, Target
        lw $4, 0($7)
        .
        .
        .
Target: sub $4, $5, $6
        ...

Fragment 3:

        add $2, $2, $8
        beq $2, $3, Target
        lw $4, 0($8)
        .
        .
        .
Target: lw $5, 0($7)
        ...

Re-arrange or re-write each of the fragments so that it will work correctly with a branch delay slot and maximize performance. (The dots represent an unknown amount of other code that you can't change.) What is the average number of cycles that were saved or lost in each case if you used the delayed branch architecture? (Assume branches are taken 60% of the time.)