Homework 4 // Due at Lecture Tues Oct 31
Problem 1
Using Verilog, design an 8 -by -16- bit register file.
Figure 1 gives the high-level interface.
It has one write port, two read ports, three register select inputs (two for read and one for write,)
a write enable, a reset and a clock input. All register
state changes occur on the rising edge of the clock. As always, your basic building block
must be the D-flipflop. The read ports should
be all combinational logic. Do not use tri-state logic in your design.
+--------------------+
| |
ReadSel0[2:0] >---| |----> DataOut0[15:0]
ReadSel1[2:0] >---| |
| |
WriteSel[2:0] >---| |
DataIn[15:0] >---| |----> DataOut1[15:0]
| |
write >---| |
clk >---| |
rst >---| |
+--------------------+
Use the following top-level module shell exactly:
module regfile(input clk, rst, write,
input [2:0] ReadSel0, ReadSel1, WriteSel,
input [15:0] DataIn,
output [15:0] DataOut0, DataOut1);
//code here
endmodule
When the write enable is asserted (high) the selected register will
be written with the data from the data in port. The write occurs on the
next rising clock edge; write data cannot flow through to a read port
during the same cycle. Data will always be present on the DataOut
ports regaurdless of wether or not write is high.
The reset signal is synchronous and when asserted
(active high), resets all the register values to 0.
You must use a hierarchical design. Design a 16-bit register
first, and then put 8 of them together with additional logic to build
the register file.
For simulation purposes, any signal that is wider than one bit
should be represented as a bus going into or out of your system. For
a 16-bit bus, there should not be 16 signals on your trace output.
Make sure that every register gets read and written properly, and that
each bit of each register has been both low and high at least once. A
simultaneous read and write on the same register must work properly,
as must a case of read and write at the same cycle but on different
registers.
For extra credit, you can parameterize the register file so that it
can be an arbitrary width and height using the Verilog PARAMETER
feature. If you chose to implement this, set the default parameters
to 8x16.
You should hand in:
- Electronic copies of all Verilog, DO, and/or Tcl files used
in your design. Submit the files by copying them to
~cs552-2/public/dropbox/HW4/P1/<your login id>.
- Annotated simulation results, in the form of a simulation
wave trace or script output, that shows the design working. If
you implemented the extra credit, please make a note.
- A brief justification of your testing methodology.
Problem 2
In Verilog, create a register file that includes
internal bypassing so that results written in one cycle can be read during
the same cycle. Do this by writing an outer "wrapper" module that instantiates
your existing (unchanged) register file module; your new module will just add
the bypass logic. The list of inputs and outputs of the outer module should be
the same as that of the inner module. Submit your Verilog source and your
testing results.
Hint: Not counting the header, your new module should be no more than about
five or six lines long.
Use the following module header exactly:
module bp_regfile(input clk, rst, write
input [2:0] ReadSel0, ReadSel1, WriteSel,
input [15:0] DataIn,
output [15:0] DataOut0, DataOut1);
//code here
endmodule
You should hand in:
- Electronic copies of all Verilog, DO, and/or Tcl files used
in your design. Submit the files by copying them to
~cs552-2/public/dropbox/HW4/P2/<your login id>.
- Annotated simulation results, in the form of a simulation
wave trace or script output, that shows the design working.
- A brief justification of your testing methodology.
Problem 3
Indicate all of the true, anti-, and output-dependences
in the following segment of MIPS assembly code:
xor $1, $2, $3
and $4, $5, $6
sub $7, $4, $5
add $5, $1, $5
or $4, $7, $4
For the code above, which of the dependences
will manifest themselves as hazards in the pipeline in Figure 6.17 on page 395
of COD3e? How are these hazards resolved in this pipeline? Assuming the 'xor'
instruction enters fetch (F) in cycle 1, in what cycle does the 'or'
instruction enter writeback (W)? Show your work in a pipeline diagram.
(Assume that the register file cannot read and write the same register
in the same cycle and get the new data.)
How does your answer change if you consider the pipeline in figure
6.36, on page 416 of COD3e? (Assume that the register file contains
internal bypassing and can read and write the same register in
the same cycle and get the new data.)
Problem 4
Consider the following code sequence and the
datapath in figure 6.27 on page 404 of COD3e. Assuming the first instruction is
fetched in cycle 1 and the branch is not taken, in which cycle does the 'and'
instruction write its value to the register file? What if the branch IS taken?
(Assume no branch prediction). Show pipeline diagrams.
beq $2, $3, foo
add $3, $4, $5
sub $5, $6, $7
or $7, $8, $9
foo: and $5, $6, $7
Problem 5
Consider the pipeline in Figure 6.27 on page 404; assume predict-not-taken
for branches and assume a "Hazard detection unit" in the ID stage as shown on
page 461. Can an attempt to flush and an attempt to stall occur simultaneously?
If so, do they result in conflicting actions and/or cooperating actions?
If there are any cooperating actions, how do they work together?
If there are an conflicting actions, which should take priority?
What would you do in the design to make sure this works correctly?
You may want to consider the following code sequence to help you answer this
question:
beq $1, $2, TARGET #assume that the branch is taken
lw $3, 40($4)
add $2, $3, $4
sw $2, 40($4)
TARGET: or $1, $1, $2
Problem 6
Consider a pipeline where branches are predicted not-taken, and a taken branch
introduces three-cycle penalty. Suppose you are considering adding a delayed branch
slot to your instruction set architecture, so that taken branches would only
have a two-cycle penalty. Consider the following three fragments of code:
Fragment 1:
movei $3, 10 // End-of-loop count
.
.
.
addi $2, $2, 1
beq $2, $3, Target
.
.
.
Target: ...
Fragment 2:
add $2, $2, $8
beq $2, $3, Target
lw $4, 0($7)
.
.
.
Target: sub $4, $5, $6
...
Fragment 3:
add $2, $2, $8
beq $2, $3, Target
lw $4, 0($8)
.
.
.
Target: lw $5, 0($7)
...
Re-arrange or re-write each of the fragments so that it will work
correctly with a branch delay slot and maximize performance. (The
dots represent an unknown amount of other code that you can't change.)
What is the average number of cycles that were saved or lost in each
case if you used the delayed branch architecture? (Assume branches
are taken 60% of the time.)
While a good idea at the time, branch delay slots are discouraged
in modern processors with deep pipelines in favor of dynamic branch
predictors. Why do you think this is so? Why would a branch delay
instruction perform poorly in a long pipeline?
|