Here's a solution to the midterm with some of the questions summarized:


1.  Lookahead adder design

a)  G'(7) = X7 * Y7
    G''(3) = G'(12) * P'(13) * P'(14) * P'(15) + G'(13) * P'(14) * P'(15) + G'(14) * P'(15) + G'(15)

    P''(1) = P'(4) * P'(5) * P'(6) * P'(7)

    C(48) = C(0) * P'''(0) * P'''(1) * P'''(2) + G'''(0) * P'''(1) * P'''(2) + G'''(1) * P'''(2) + G'''(2)

    C15 = C(12) * P'(12) * P'(13) * P'(14) + G'(12) * P'(13) * P'(14) + G'(13) * P'(14) + G'(14)

b)  Assume that each gate delay is t time units, and all gates are available with up to 5 inputs.

    C48: 7T     P' and G' are ready at T, P'' and G'' are ready at 3T, 
		P''' and G''' are ready at 5T, C48 ready at 7T

    C15: 7T	P' and G' are ready at T, P'' and G'' are ready at 3T,
		C12 is ready at 5T, C15 is ready at 7T

    S63: 13T	C48 is ready at 7T, C60 is ready at 9T, C63 is ready at 11T, 
		S63 is ready at 13T

c) How much faster is the above adder than a ripple carry adder?

   Ripply carry delay = 64 * 2T = 128T.  128T / 13T = 9.85x faster


2. Short questions
  i) What is the primary advantage of fixed-sized opcodes?

     Instruction decode is faster and more efficient.  Control does not need to determine the length/
     position of the opcode in the instruction.

  ii) What is the primary disadvantage of variable-length instructions?

      Pipelining is more difficult.  With variable-length instructions, the next PC can't be calculated
      simultaneously with instructino fetch.
  
  iii) What is system balance?

        System balance ensures that particular components of a system do not present a bottleneck for
        performance.  In a balanced system the capabilities of a unit (e.g., the bandwidth) are equal
        to the capabilities that other units demand of it.  This affects the design of the individual
        components so that no component is either over- or under-designed.

  iv)  How do we introduce bubbles into a pipeline?

       Pipeline latches prior to the point where the bubble is being inserted are not clocked, thus
       stalling those stages.  Pipeline latches after the bubble point continue to be clocked, and the stage
       in which a bubble insert receives zeros for its control signals (thus inserting a NOP into
       the stage that receives a bubble).

  v)   What is a microprogram, and how does it differ from a normal machine-language program?

       A microprogram is an FSM implemented by hardware designers using a ROM.  It implements the control
       for a datapath; contents of addresses of a microprogram contain control signal information in order
       to implement the control for a particular stage of an instruction, as well as the address for the
       next microinstruction.  
       
       A machine-language program is implemented in software by end-users.  Unlike a microprogram, 
       instructions in a machine-language program do not contain information regarding the implementation
       of the instruction in the datapath.  An instruction in a machine-language program is the smallest
       atom, whereas a machine-language instruction is implemented by a number of microinstructions in
       a microprogram.

3.  MIPS ISA

  i)    In the MIPS instruction set, for Jump instructions, only 26 bits of the target address are
	available in the instruction (the other 6 bits are the opcode).  Why didn't the designers
	choose to provide more than 26 bits of the target address in the instruction?

	Extending the target address to a full 32-bits is not necessary, since the last two bits of the PC
	are always 00...the maximum bits necessary for the target address would be 30 bits.  But implementing
	more than 26-bits of the target address would have made the jump instructions larger than 32-bits.
	This would either make pipelining more difficult if jump instructions were larger than 32-bits and
	all other instructions were 32-bits, or would result in wasted code size if all instructions were
	lengthened to match the now larger jump instruction size.

  ii)   How is a 32-bit jump target address for a jump instruction calculated?
  
	The 26-bit target address is shifted left by 2bits to produce bits 27-0 of the new PC.  Bits 31-28 
	of the new PC = bits 31-28 of PC+4.

  iii) In MIPS, why is the offset of a branch instruction from the PC of the next instruction
        instead of the PC of the current instruction?

        The PC of the next instruction can be computed in the first stage of the pipeline.
        By making the result of the branch be at this value + offset, only one more arithmetic operation
        is necessary in the event of a branch.  If the branch target was the PC + offset instead
        of PC + 4 + offset, the current value of the PC would have to be subtracted by 4 before
        computing the branch target.

  iv) Why is the branch offset shifted left by 2 bits while the displacement for loads and stores are 
      not shifted?
        
        Because instructions are four bytes long, their addresses always have their two least-significant 
        bits = 0, which is what the shift accomplishes.  Data needs to be byte-addressable, so displacements
        are not shifted.


4.  System Performance
    Summary: base system spends 88% of time computing, 12% time waiting for disk.
	     Integer CPI = 1, 40% of instructions executed
	     FP CPI = 4, 30% of instructions executed
	     Other CPI = 2, 30% of instructions executed

    i) a.  Total computation time reduced by 30%
           Speedup = 1/(0.88 * 0.7 + 0.12) = 1.36

       b.  Disk wait time reduced by 80%
	   Speedup = 1/(0.88 + 0.12 * 0.2) = 1.11

       c.  FP CPI reduced to 2.
	   Old CPI = 0.4 * 1 + 0.3 * 4 + 0.3 * 2 = 2.2
	   New CPI = 0.4 * 1 + 0.3 * 2 + 0.3 * 2 = 1.6

	   Computation speedup = 2.2 / 1.6 = 1.375

	   Total speedup = 1 / (0.88 / 1.375 + 0.12) = 1.32

     ii) Modification in part a has the best speedup.

     iii)
       Part b: If disk wait time was 0: speedup = 1/0.88 = 1.14  <= not faster than part a.
       Part c: If FP CPI = 0:
	    New CPI = 0.4 * 1 + 0.3 * 2 = 1, Computation speedup = 2.2 / 1.0 = 2.2

	    Total speedup = 1 / (0.88 / 2.2 + 0.12) = 1.92  <= faster than part a.


5.  Pipelining
    
    Register writes occur in the first half of the cycle, reads in the second half.  There are NO bypasses in the pipeline.

    instruction sequence:
    
    R2 <= MEM(R1 + R3)          LOAD
    R3 <= R2 + R4               ADD1
    R1 <= R3 - R4               SUB
    R1 <= R5 + R3               ADD2
    MEM(R4 + R6) <= R5          STORE

a)  Cycles   IF     ID     EX      MEM     WB
    1        LOAD
    2        ADD1   LOAD
    3        SUB    ADD1   LOAD
    4        SUB    ADD1   bubble  LOAD   
    5        SUB    ADD1   bubble  bubble  LOAD
    6        ADD2   SUB    ADD1    bubble  bubble
    7        ADD2   SUB    bubble  ADD1    bubble
    8        ADD2   SUB    bubble  bubble  ADD1
    9        STORE  ADD2   SUB     bubble  bubble
    10              STORE  ADD2    SUB     bubble
    11                     STORE   ADD2    SUB
    12                             STORE   ADD2
    13					   STORE
   

b)  Now assume that register writes occur in the SECOND half of the clock cycle and reads occur
    in the first half.  Draw a bypass so that no bubbles are added to the code sequence from part a.

    The important thing to realize is that in this case, a bubble would be added if an instruciton
    is in the write-back stage while another instruction that uses the value is in the register fetch 
    stage.   A bypass must present the correct value to the input of the ALU when
    the instruction needing it is the EX stage.  But in this case, the instruction producing the value
    is no longer in the pipeline when the instruction using the value is in the EX stage.  So the bypass
    must go from the output of the MEM/WB latch to the INPUT of the ID/EX latch.  That way, when the 
    instruction enters the EX stage, the bypassed value will still be in the pipeline.

c)  Eliminate all bubbles by adding a bypass (independent of the modifications of part b) and by
    reordering the code sequence.
   
    Both a bypass and a reordered code sequence is required to eliminate all bubbles.  The first
    set of bubbles after the load instruction cannot be completely eliminated by a bypass.  This is
    because ADD1 right after it is dependent, so when it wants to enter the EX stage, the LOAD
    will be accessing memory.  Therefore a non-dependent instruction must be inserted before ADD2; the
    only instruction that satisfies this condition is STORE.  

    There still are two bubbles caused by dependencies between instructions in the EXE or WB stage and those 
    wanting to enter the EX stage.  Thus bypasses must be added from the output of the MEM/WB and 
    EXE/MEM latches to the inputs of the ALU.

    Here's the reordered code sequence:

    LOAD
    STORE
    ADD1
    SUB
    ADD2


6.  Datapath

    Implement a SLT instruction:

    SLT Rd, Rs, Rt.  If the contents of register Rs is less than the contents of register Rt, then 1 is 
    written to Rd.  Otherwise, 0 is written to Rd.  The values "0" and "1" are zero-extended to 32-bits.

    Bits: 31-26 = opcode
	  25-21 = Register Rs
	  20-16 = Register Rt
	  15-11 = Register Rd
	  5-0   = function field

    Changes to datapath: A zero-extending unit is added that takes bit 31 of ALUout (the sign bit of the ALU
    output) as its input.  A third input is added to the MemToReg mux, which takes the output of the new
    zero-extender as its input.  The MemToReg signal is widened to 2 bits.

    Control:

    Cycle     Functional Description       Signals
    -------------------------------------------------------------------------------------------
    1         Instruction -> IR            MemRd = 1, ALUSrcA = 0, IorD = 0, IRWrite = 1
              and PC = PC + 4              ALUSrcB = 01, ALUOp = 00, PCWrite = 1, PCSource = 00

    2         Read Rs and Rt,              ALUSrcA = 0, ALUSrcB = 11, ALUOp = 00,
              compute branch target        MemRd = 0, PCWrite = 0, IRWrite = 0

    3         Subtract Rs - Rt             ALUSrcA = 1, ALUSrcB = 00, ALUOp = 10

    4         Write to Rd                  MemToReg = 10, RegDst = 0, RegWrite = 1