Computer Sciences Dept.

CS/ECE 552 Introduction to Computer Architecture Fall 2006 Section 2
Instructor Mark D. Hill and T. A. Derek Hower
URL: http://www.cs.wisc.edu/~markhill/cs552/Fall2006/

Homework 4 // Due at Lecture Tues Oct 31


Problem 1

Using Verilog, design an 8 -by -16- bit register file. Figure 1 gives the high-level interface.  It has one write port, two read ports, three register select inputs (two for read and one for write,) a write enable, a reset and a clock input. All register state changes occur on the rising edge of the clock. As always, your basic building block must be the D-flipflop. The read ports should be all combinational logic. Do not use tri-state logic in your design.

                         +--------------------+
                         |                    |
       ReadSel0[2:0] >---|                    |----> DataOut0[15:0]
       ReadSel1[2:0] >---|                    |
                         |                    |
       WriteSel[2:0] >---|                    |
        DataIn[15:0] >---|                    |----> DataOut1[15:0]
                         |                    |
               write >---|                    |
                 clk >---|                    |
                 rst >---|                    |
                         +--------------------+

Use the following top-level module shell exactly:

module regfile(input clk, rst, write,
               input [2:0] ReadSel0, ReadSel1, WriteSel, 
	      input [15:0] DataIn, 
	      output [15:0] DataOut0, DataOut1);

	      //code here
endmodule

When the write enable is asserted (high) the selected register will be written with the data from the data in port. The write occurs on the next rising clock edge; write data cannot flow through to a read port during the same cycle. Data will always be present on the DataOut ports regaurdless of wether or not write is high.

The reset signal is synchronous and when asserted (active high), resets all the register values to 0.

You must use a hierarchical design. Design a 16-bit register first, and then put 8 of them together with additional logic to build the register file.

For simulation purposes, any signal that is wider than one bit should be represented as a bus going into or out of your system. For a 16-bit bus, there should not be 16 signals on your trace output. Make sure that every register gets read and written properly, and that each bit of each register has been both low and high at least once. A simultaneous read and write on the same register must work properly, as must a case of read and write at the same cycle but on different registers.

For extra credit, you can parameterize the register file so that it can be an arbitrary width and height using the Verilog PARAMETER feature. If you chose to implement this, set the default parameters to 8x16.

You should hand in:

  1. Electronic copies of all Verilog, DO, and/or Tcl files used in your design. Submit the files by copying them to ~cs552-2/public/dropbox/HW4/P1/<your login id>.
  2. Annotated simulation results, in the form of a simulation wave trace or script output, that shows the design working. If you implemented the extra credit, please make a note.
  3. A brief justification of your testing methodology.

Problem 2

    In Verilog, create a register file that includes internal bypassing so that results written in one cycle can be read during the same cycle. Do this by writing an outer "wrapper" module that instantiates your existing (unchanged) register file module; your new module will just add the bypass logic. The list of inputs and outputs of the outer module should be the same as that of the inner module. Submit your Verilog source and your testing results.
Hint: Not counting the header, your new module should be no more than about five or six lines long.

Use the following module header exactly:

module bp_regfile(input clk, rst, write
                  input [2:0] ReadSel0, ReadSel1, WriteSel, 
	         input [15:0] DataIn, 
	         output [15:0] DataOut0, DataOut1);

	         //code here
endmodule

You should hand in:

  1. Electronic copies of all Verilog, DO, and/or Tcl files used in your design. Submit the files by copying them to ~cs552-2/public/dropbox/HW4/P2/<your login id>.
  2. Annotated simulation results, in the form of a simulation wave trace or script output, that shows the design working.
  3. A brief justification of your testing methodology.

Problem 3

    Indicate all of the true, anti-, and output-dependences in the following segment of MIPS assembly code:

xor    $1, $2, $3
and    $4, $5, $6
sub    $7, $4, $5
add    $5, $1, $5
or     $4, $7, $4

For the code above, which of the dependences will manifest themselves as hazards in the pipeline in Figure 6.17 on page 395 of COD3e? How are these hazards resolved in this pipeline? Assuming the 'xor' instruction enters fetch (F) in cycle 1, in what cycle does the 'or' instruction enter writeback (W)? Show your work in a pipeline diagram. (Assume that the register file cannot read and write the same register in the same cycle and get the new data.)

How does your answer change if you consider the pipeline in figure 6.36, on page 416 of COD3e? (Assume that the register file contains internal bypassing and can read and write the same register in the same cycle and get the new data.)


Problem 4

Consider the following code sequence and the datapath in figure 6.27 on page 404 of COD3e. Assuming the first instruction is fetched in cycle 1 and the branch is not taken, in which cycle does the 'and' instruction write its value to the register file? What if the branch IS taken? (Assume no branch prediction). Show pipeline diagrams.

        beq    $2, $3, foo
        add    $3, $4, $5
        sub    $5, $6, $7
        or     $7, $8, $9
foo:    and    $5, $6, $7


Problem 5

Consider the pipeline in Figure 6.27 on page 404; assume predict-not-taken for branches and assume a "Hazard detection unit" in the ID stage as shown on page 461. Can an attempt to flush and an attempt to stall occur simultaneously? If so, do they result in conflicting actions and/or cooperating actions? If there are any cooperating actions, how do they work together? If there are an conflicting actions, which should take priority? What would you do in the design to make sure this works correctly? You may want to consider the following code sequence to help you answer this question:

        beq $1, $2, TARGET  #assume that the branch is taken
        lw  $3, 40($4)
        add $2, $3, $4
        sw  $2, 40($4)
TARGET: or  $1, $1, $2

Problem 6

Consider a pipeline where branches are predicted not-taken, and a taken branch introduces three-cycle penalty. Suppose you are considering adding a delayed branch slot to your instruction set architecture, so that taken branches would only have a two-cycle penalty. Consider the following three fragments of code:

Fragment 1:


        movei $3, 10  // End-of-loop count
        .
        .
        .
        addi $2, $2, 1
        beq $2, $3, Target
        .
        .
        .
Target: ...

Fragment 2:

        add $2, $2, $8
        beq $2, $3, Target
        lw $4, 0($7)
        .
        .
        .
Target: sub $4, $5, $6
        ...

Fragment 3:

        add $2, $2, $8
        beq $2, $3, Target
        lw $4, 0($8)
        .
        .
        .
Target: lw $5, 0($7)
        ...

Re-arrange or re-write each of the fragments so that it will work correctly with a branch delay slot and maximize performance. (The dots represent an unknown amount of other code that you can't change.) What is the average number of cycles that were saved or lost in each case if you used the delayed branch architecture? (Assume branches are taken 60% of the time.)

While a good idea at the time, branch delay slots are discouraged in modern processors with deep pipelines in favor of dynamic branch predictors. Why do you think this is so? Why would a branch delay instruction perform poorly in a long pipeline?

 
Computer Sciences | UW Home