Here's a solution to the midterm with some of the questions summarized:


1.  Lookahead adder design

a)  G''(3) = G'(12) * P'(13) * P'(14) * P'(15) + G'(13) * P'(14) * P'(15) + G'(14) * P'(15) + G'(15)
    P''(2) = P'(11) * P'(10) * P'(9) * P'(8)
    C(20) = G''(4) + P''(4) * C16
    C(32) = G'''(1) + P'''(1) * G'''(0) + P'''(1) * P'''(0) * C0
    C3 = G'(2) + P'(2) * G'(1) + P'(2) * P'(1) * G'(0) + P'(2) * P'(1) * P'(0) * C0

b)  Design the lookahead and addition logic for a incrementer.
    
    The B input and the generates are zero and C0 = 1.

    carry-lookahead block:
       C1 = C0 * p0
       C2 = C0 * p1 * p0
       C3 = C0 * p2 * p1 * p0
       P = p3 * p2 * p1 * p0

    adder:
       sum = a XOR c
       p = a

c)  Assume that each gate delay is t time units, and all gates are available with up to 4 inputs.
    How much faster than the adder would the incrementer be?

    Assuming XOR gates with delay t:

    Adder:
	C48 is ready at t (P' and G') + 2t (P'' and G'') + 2t (P''' and G''') + 2t (carry) = 7t
	C63 is ready at 7t + 2t + 2t = 11t
	Sum63 is ready at 11t + 2t = 13t

    Incrementer: (assuming first-level propogates are ready at time = t)
	C48 is ready at 0 (P') + t (P'') + t (P''') + t (carry) = 3t
	C63 is ready at 3t + t + t = 5t
	Sum63 is ready at 5t + 2t = 7t


2. System performance question.  

   The base system spends 82% of the time computing and 18% of the time waiting for the disk.
   Integer instructions (40% of executed instructions) have a CPI of 1, floating-point instructions
   (30%) have a CPI of 5, and other instructions (30%) have a CPI of 2.

i. a)  The processor is replaced with one that reduces computation time by 35%.

       Speedup = 1 / ((1 - 0.82) + 0.82 * 0.65) = 1.40

   b)  The disk is replaced with one that reduces disk wait time by 85%.

       Speedup = 1 / ((1 - 0.18) + 0.18 * 0.15) = 1.18

   c)  The floating-point CPI is changed to 3.

       Average CPI (old) = 0.40 * 1 + 0.30 * 5 + 0.30 * 2 = 2.5
       Average CPI (enhanced) = 0.40 * 1 + 0.30 * 3 + 0.30 * 2 = 1.9

       Speedup (computation) = 2.5 / 1.9 = 1.316

       Speedup = 1 / ((1 - 0.82) + 0.82 / 1.316) = 1.245

ii.    Part a results in the best speedup

iii.   If the disk was infinitely fast:
	  speedup = 1 / (1 - 0.18) = 1.22 <= still slower than part a

       If the floating-point computation was infinitely fast:
	  Average CPI (old) = 2.5 (from part i)	
	  Average CPI (enhanced) = 0.40 * 1 + 0.30 * 0 + 0.30 * 2 = 1
	  
	  Speedup (computation) = 2.5 / 1 = 2.5
	  
	  Speedup = 1 / ((1 - 0.82) + 0.82 / 2.5) = 1.969 <= better speedup than part a


3.  MIPS ISA

i.	In MIPS, why is the offset of a branch instruction from the PC of the next instruction
	instead of the PC of the current instruction?

	The PC of the next instruction can be computed in the first stage of the pipeline.
	By making the result of the branch be at this value + offset, only one more arithmetic operation
	is necessary in the event of a branch.  If the branch target was the PC + offset instead
	of PC + 4 + offset, the current value of the PC would have to be subtracted by 4 before
	computing the branch target.

ii.	What are MIPS SET instructions and why are they useful?

	SET instructions are comparison instructions.  For the instruction slt r1, r2, r3, if r2 < r3,
	then r1 = 1, else r1 = 0.  They are used for the conditions of branch instructions so that
	the branch condition testing can be simplified (simply testing a bit in a register), thus
	allowing branches to be resolved earlier in the pipeline.

iii.	Why are the offsets for branch instructions and displacements for load and store instructions
	in the MIPS ISA limited to 16-bits?

	The opcode and two register addresses use up 16 bits of the instruction.  To support offsets
	greater than 16 bits, longer fixed-length instructions or variable-lengthed instructions
	would be required.

iv.	Why is the branch offset shifted left by 2 bits while the displacement for loads and stores
	are not shifted?
	
	Because instructions are four bytes long, their addresses always have their two least-significant 
	bits = 0, which is what the shift accomplishes.  Data needs to be byte-addressable, so displacements
	are not shifted.


4.  Short answer questions

i.	What is system balance?

	System balance ensures that particular components of a system do not present a bottleneck for
	performance.  In a balanced system the capabilities of a unit (e.g., the bandwidth) are equal
	to the capabilities that other units demand of it.  This affects the design of the individual
	components so that no component is either over- or under-designed.

ii.	In a typical 5 or 6 stage pipeline, the CPI might be in the range of 1.0 to 1.5.  Does this
	mean that most instructions have a latency of 1 or 2 cycles?

	No; in a pipeline, an instruction may be completed every cycle or two due to parallelism and
	overlap, but the instruction latency is still bound by the length of the pipeline.

iii.	Why do conditional branches impact the performance of a pipelined implementation?

	Conditional branches present a control hazard that can stall instruction fetch and thus
	created bubbles in the pipeline.

iv.	3 solutions to reduce impact of branches in a pipeline:

	- Delayed branches: change the semantics of the branch to always execute the instruction immediately
	  following the branch regardless of branch outcome, and have the compiler insert a non-dependent
	  instruction in this branch delay slot.

        - Rudimentary Static Branch Prediction: Continue executing instruction from the not-taken path of
	  the branch and squash instructions if the branch is taken.
     
        - Move up the branch resolution point: resolve branches in the ID stage rather than in the
	  MEM or WB stages, for example using SET instructions following by simple branch instructions.


5.  Pipelining

    
    Register writes occur in the first half of the cycle, reads in the second half.  There is a bypass
    from the output of the EX/MEM latch to the inputs of the ALU.

    instruction sequence:
    
    r3 <- r2 + r1		ADD1
    load r2 <= mem (r3 + r1)	LOAD
    r2 <- r2 + r1		ADD2
    r4 <- r3 - r1		SUB
    r1 <- r2 + r1		ADD3

a)  Cycles   IF    ID    EX     MEM      WB
    1	     ADD1
    2	     LOAD  ADD1
    3	     ADD2  LOAD  ADD1
    4	     SUB   ADD2	 LOAD    ADD1   
    5	     SUB   ADD2	 bubble  LOAD    ADD1
    6	     SUB   ADD2  bubble	 bubble	 LOAD
    7	     ADD3  SUB	 ADD2	 bubble	 bubble
    8		   ADD3	 SUB	 ADD2	 bubble
    9		   ADD3	 bubble	 SUB	 ADD2
    10			 ADD3	 bubble	 SUB
    11				 ADD3	 bubble
    12					 ADD3
   

b)  Now assume that register writes occur in the SECOND half of the clock cycle and reads occur
    in the first half.  Draw a bypass so that no bubbles are added to the code sequence from part a.

    The important thing to realize is that in this case, a bubble would be added if an instruciton
    is in the write-back stage while another instruction that uses the value is in the register fetch 
    stage.  This happens twice in the code sequence; with LOAD and ADD2 in cycle #6 and with ADD2
    and ADD3 in cycle #9.  A bypass must present the correct value to the input of the ALU when
    the instruction needing it is the EX stage.  But in this case, the instruction producing the value
    is no longer in the pipeline when the instruction using the value is in the EX stage.  So the bypass
    must go from the output of the MEM/WB latch to the INPUT of the ID/EX latch.  That way, when the 
    instruction enters the EX stage, the bypassed value will still be in the pipeline.

c)  Eliminate all bubbles by adding a bypass (independent of the modifications of part b) and by
    reordering the code sequence.
   
    Both a bypass and a reordered code sequence is required to eliminate all bubbles.  The first
    set of bubbles after the load instruction cannot be completely eliminated by a bypass.  This is
    because ADD2 right after it is dependent, so when it wants to enter the EX stage, the LOAD
    will be accessing memory.  Therefore a non-dependent instruction must be inserted before ADD2; the
    only instruction that satisfies this condition is SUB.  

    There still are two bubbles caused by dependencies between instructions in the WB stage and those 
    wanting to enter the EX stage.  Thus a bypass must be added from the output of the MEM/WB latch to
    the inputs of the ALU.

    Here's the reordered code sequence:

    ADD1
    LOAD
    SUB
    ADD2
    ADD3


6.  Datapath question

    Implement a LDINC Rs, offset (Rt) instruction.  This performs a load using the addressed by the sum
    of the sign-extended offset (bits 15-0) and the value in Rt (bits 25-21) into register Rs (bits 20-16).
    The value of Rt is also incremented by 4.

    The one modification to the datapath that is necessary is that a new input is required on the RegDst
    mux that takes bits 25-21 of the instruction as the input.  This is so that the register specified by 
    Rt can be written to.

    Cycle     Functional Description	   Signals
    -------------------------------------------------------------------------------------------
    1	      Instruction -> IR		   MemRd = 1, ALUSrcA = 0, IorD = 0, IRWrite = 1
	      and PC = PC + 4		   ALUSrcB = 01, ALUOp = 00, PCWrite = 1, PCSource = 00

    2	      Read Rs and Rt,		   ALUSrcA = 0, ALUSrcB = 11, ALUOp = 00,
	      compute branch target	   MemRd = 0, PCWrite = 0, IRWrite = 0

    3	      Compute mem address	   ALUSrcA = 1, ALUSrcB = 10, ALUOp = 00

    4	      Access memory		   MemRd = 1, IorD = 1
    
    5	      write-back, increment Rt	   ALUSrcA = 1, ALUSrcB = 01, ALUOp = 00,
					   RegWrite, MemToReg = 1, RegDst = 00

    6	      write back Rt		   RegWrite, MemToReg = 0, RegDst = 10