CS/ECE 552 Spring 2010

CS/ECE 552 : Introduction to Computer Architecture
Spring 2010
Prof. Wood
Problem Set #4

```
Homework is due at start of class
```

Problems 1 - 3 MUST be done with your project group    (all electronic:  handin to “hw4”)

Problems 4 - 7 MUST be done ALONE                                  (all paper)

Problem 8 must also be done ALONE                                  (all electronic: handin to “inst_test”)

No exceptions to the above handin rules will be allowed, as this is already unduly complicated (to grade).

You must abide by theVerilog file naming conventions

```
All verilog code must pass Vcheck
```

Each problem must be in its own directory

If a problem requires files from a different directory, then create a copy of the file in each directory.

Problem 1 - 10 Points

In Verilog, create a register file that includes internal bypassing so that results written in one cycle can be read during the same cycle. Do this by writing an outer "wrapper" module that instantiates your existing (unchanged) register file module; your new module will just add the bypass logic. The list of inputs and outputs of the outer module should be the same as that of the inner module. Submit your Verilog source and your testing results.

Call this module rf_bypass and it should be in a file called rf_bypass.v
Modify rf_hier.v from problem3 so that it now instantiates rf_bypass instead of rf.
The inputs and output interface for rf_bypass.v should be identical to rf.v
Use the rf_bypass_bench.v testbench. Here are some usage instructions: Usage instructions.

What to submit: (Directory name: prob1)

Describe precisely how you augmented your hw3 register file in README.txt
Any modifications to the testbench if required. If you use the testbench provided, electronically submit the text output of the program as rf_bench.out (see 4 below). Modelsim will write the text output to a file called transcript in your project directory.
All your verilog source code.

Problem 2 – 10 Points

Synthesize your register file from homework 3

Synthesize will create the synth directory which will include rf.syn.v, area report, timing report, etc.

What to submit: (Directory name: prob2)

Verilog files from hw3's register file
Add the entire synth directory
Make sure rf.syn.v, and the 4 report files are present (Make sure that in the area report no cell has an area of zero)
In the readme, fill in this info:
1. Total area
2. Worst case slack

Problem 3 – 10 Points

Synthesize your FIFO from homework 3.

Synthesize will create the synth directory which will include fifo.syn.v, area report, timing report, etc.

What to submit: (Directory name: prob3)

Verilog files from hw3's fifo
Add the entire synth directory
Make sure fifo.syn.v, and the 4 report files are present (Make sure that in the area report no cell has an area of zero)
In the readme, fill in this info:
1. Total area
2. Worst case slacklack

#end group work#

Problem 4 – 15 Points

Consider the following code sequence and the datapath in figure 4.51 on page 362 of COD4e. Assuming the first instruction is fetched in cycle 1 and the branch is not taken, in which cycle does the 'add' instruction write its value to the register file? What if the branch IS taken? (Assume no branch prediction). Show pipeline diagrams.

          beq    $2, $1, loc
          xor    $1, $4, $3
          and    $3, $6, $7
          sub    $7, $5, $8
    loc:  add    $3, $6, $7

Problem 5 – 15 Points

Indicate all of the true, anti-, and output-dependencies in the following segment of MIPS assembly code:

    sub    $2, $7, $3
    add    $4, $5, $6
    or     $1, $4, $5
    add    $5, $2, $5
    sw     $4, 20($1)
    xor    $4, $1, $4

For the code above, which of the dependencies will manifest themselves as hazards in the pipeline in Figure 4.41 on page 355 of COD4e? How are these hazards resolved in this pipeline? Assuming the 'sub' instruction enters fetch (F) in cycle 1, in what cycle does the 'xor' instruction enter writeback (W)? Show your work in a pipeline diagram. (Assume that the register file cannot read and write the same register in the same cycle and get the new data.)

How does your answer change if you consider the pipeline in figure 4.60, on page 375 of COD4e? (Assume that the register file contains internal bypassing and can read and write the same register in the same cycle and get the new data.)

Problem 6 – 10 Points

Consider the pipeline in Figure 4.51 on page 362; assume predict-not-taken for branches and assume a "Hazard detection unit" in the ID stage as shown on page 379. Can an attempt to flush and an attempt to stall occur simultaneously? If so, do they result in conflicting actions and/or cooperating actions? If there are any cooperating actions, how do they work together? If there are an conflicting actions, which should take priority? What would you do in the design to make sure this works correctly? You may want to consider the following code sequence to help you answer this question:

        beq $5, $2, loc  #assume that the branch is taken
        lw  $3, 40($4)
        add $2, $3, $4
        sw  $2, 40($4)
loc:    or  $5, $5, $2

Problem 7 – 15 Points

Consider a pipeline where branches are predicted not-taken, and a taken branch introduces three-cycle penalty. Suppose you are considering adding a delayed branch slot to your instruction set architecture, so that taken branches would only have a two-cycle penalty. Consider the following three fragments of code:

Fragment 1:

        add $5, $5, $2
        beq $5, $6, Target
        lw $4, 0($2)
        .
        .
        .
Target: lw $1, 0($7)
        ...


Fragment 2:

        add $5, $5, $2
        beq $5, $6, Target
        lw $4, 0($7)
        .
        .
        .
Target: sub $4, $8, $3
        ...


Fragment 3:

        movei $2, 21  // End-of-loop count
        .
        .
        .
        addi $4, $4, 1
        beq $4, $2, Target
        .
        .
        .
Target: ...

Re-arrange or re-write each of the fragments so that it will work correctly with a branch delay slot and maximize performance. (The dots represent an unknown amount of other code that you can't change.) What is the average number of cycles that were saved or lost in each case if you used the delayed branch architecture? (Assume branches are taken 60% of the time.)

While a good idea at the time, branch delay slots are discouraged in modern processors with deep pipelines in favor of dynamic branch predictors. Why do you think this is so? Why would a branch delay instruction perform poorly in a long pipeline?

Problem 8 – 15 Points

(submit this problem under inst_test, instead of hw4)

Develop instruction level tests for your processor. In this problem each of you will develop a set of small programs that are meant to test whether your processor implements these instructions correctly. You will write these programs in assembly, run them on an instruction emulator to make sure what you wrote is indeed testing the write thing. The eventual goal is to run these programs on your processor's verilog implementation and use them to test your implementation.

Each of you will be responsible for one instruction and must develop a set of simple programs for that instructions. The table below gives the assignment of instructions to students.

aarti	addi
abrown	subi
ammar	ori
asplund	andi
atishay	roli
ayoung	slli
bechard	rori
brant-ho	srai
brinsko	st
capel	ld
chanson	stu
cofell	add
diedrich	sub
emiller	or
frederic	and
frericks	rol
grigoriy	sll
halbach	ror
hang	sra
hanly	seq
hao	slt
hoese	sle
in	sco
jalal	beqz
jastrows	bnez
jatin	lbi
jimmy	slbi
jmartine	j
joel	jr
kjell	jal
klingens	jalr
langenfe	sll
markus	slt
marsh	slli
martell	bnez
michlig	ori
millican	sle
morrell	bgez
ndimick	srai
nystrom	st
ott	ror
passofar	roli
pdickey	jalr
rezny	addi
samanas	subi
schanke	add
sefiddas	sub
shourjo	jalr
soumphol	beqz
spallett	bnez
swati	sco
varun	sle
vaughn	seq
weisman	slt
wilcox	sll
wyler	sra
xiaofeng	rori

Zignego

To get you started below are two example tests for the add instruction.

add_0.asm

lbi r1, 255
lbi r2, 255
add r3, r1, r2
halt

add_1.asm

lbi r1, 255
lbi r2, 0
add r3, r1, r2
halt

You will notice one thing. The add test uses the lbi instruction also! Your goal while writing these tests is to isolate your instruction as much as possible and minimize the use of the other instructions. Identify different corner cases and the common case for your instruction and develop a set of simple test programs.

The work flow we will follow is:

Write test in WISC-SP10 assembly language.
Assemble using assembler assemble.sh
Simulate the test in the simulator and make sure your test is doing what you thought it was doing. Use the simulator:wisccalculator

Below is a short demo:

prompt% assemble.sh add_0.asm
Created the following files
loadfile_0.img  loadfile_1.img  loadfile_2.img  loadfile_3.img  loadfile_all.img  loadfile.lst

prompt% wiscalculator loadfile_all.img

WISCalculator v1.0
Author Derek Hower (drh5@cs.wisc.edu)
Type "help" for more information

Loading program...
Executing...
lbi r1, -1
PC: 0x0002 EPC 0x0000R0 0x0000 R1 0xffff R2 0x0000 R3 0x0000 R4 0x0000 R5 0x0000 R6 0x0000 R7 0x0000
lbi r2, -1
PC: 0x0004 EPC 0x0000R0 0x0000 R1 0xffff R2 0xffff R3 0x0000 R4 0x0000 R5 0x0000 R6 0x0000 R7 0x0000
add r3, r1, r2
PC: 0x0006 EPC 0x0000R0 0x0000 R1 0xffff R2 0xffff R3 0xfffe R4 0x0000 R5 0x0000 R6 0x0000 R7 0x0000
program halted
PC: 0x0008 EPC 0x0000R0 0x0000 R1 0xffff R2 0xffff R3 0xfffe R4 0x0000 R5 0x0000 R6 0x0000 R7 0x0000
Program Finished

prompt%

The simulator will print a trace of each instruction along with the state of the registers. You should examine these to make sure that your test is indeed doing what is expected. For the st instruction you will need to examine memory also.

What you need to do:

Write a set of tests for your instruction. Name them <opcode>_[0,1,2,3,4].asm
Use your discretion to decide how many tests you need
Identify corner cases. Think about possible bugs in the hardware.
Write comments in your assembly code explain what the test is doing
The goal of this problem is to make sure you understand the ISA and develop targeted tests for the hardware. Understanding the ISA is required before building hardware for it!

I will make all tests available to everyone, so you can use these to debug and test your verilog implementation. One of the first things, you must do after putting together your full processor is run each of these tests and test each individual instruction.

Submit under “inst_test”:

Save all your assembly files in this directory
Written explanation of what your tests do and justification why your set of tests is comprehensive

CS/ECE 552 Introduction to Computer Architecture Spring 2010 Section 1 Instructor David A. Wood and T. A. Tony Nowatzki URL: http://www.cs.wisc.edu/~david/courses/cs552/S10/

CS/ECE 552 Introduction to Computer Architecture Spring 2010 Section 1
Instructor David A. Wood and T. A. Tony Nowatzki
URL: `http://www.cs.wisc.edu/~david/courses/cs552/S10/`