Here's a solution to the midterm with some of the questions summarized: 1. Lookahead adder design a) G'(7) = X7 * Y7 G''(3) = G'(12) * P'(13) * P'(14) * P'(15) + G'(13) * P'(14) * P'(15) + G'(14) * P'(15) + G'(15) P''(1) = P'(4) * P'(5) * P'(6) * P'(7) C(48) = C(0) * P'''(0) * P'''(1) * P'''(2) + G'''(0) * P'''(1) * P'''(2) + G'''(1) * P'''(2) + G'''(2) C15 = C(12) * P'(12) * P'(13) * P'(14) + G'(12) * P'(13) * P'(14) + G'(13) * P'(14) + G'(14) b) Assume that each gate delay is t time units, and all gates are available with up to 5 inputs. C48: 7T P' and G' are ready at T, P'' and G'' are ready at 3T, P''' and G''' are ready at 5T, C48 ready at 7T C15: 7T P' and G' are ready at T, P'' and G'' are ready at 3T, C12 is ready at 5T, C15 is ready at 7T S63: 13T C48 is ready at 7T, C60 is ready at 9T, C63 is ready at 11T, S63 is ready at 13T c) How much faster is the above adder than a ripple carry adder? Ripply carry delay = 64 * 2T = 128T. 128T / 13T = 9.85x faster 2. Short questions i) What is the primary advantage of fixed-sized opcodes? Instruction decode is faster and more efficient. Control does not need to determine the length/ position of the opcode in the instruction. ii) What is the primary disadvantage of variable-length instructions? Pipelining is more difficult. With variable-length instructions, the next PC can't be calculated simultaneously with instructino fetch. iii) What is system balance? System balance ensures that particular components of a system do not present a bottleneck for performance. In a balanced system the capabilities of a unit (e.g., the bandwidth) are equal to the capabilities that other units demand of it. This affects the design of the individual components so that no component is either over- or under-designed. iv) How do we introduce bubbles into a pipeline? Pipeline latches prior to the point where the bubble is being inserted are not clocked, thus stalling those stages. Pipeline latches after the bubble point continue to be clocked, and the stage in which a bubble insert receives zeros for its control signals (thus inserting a NOP into the stage that receives a bubble). v) What is a microprogram, and how does it differ from a normal machine-language program? A microprogram is an FSM implemented by hardware designers using a ROM. It implements the control for a datapath; contents of addresses of a microprogram contain control signal information in order to implement the control for a particular stage of an instruction, as well as the address for the next microinstruction. A machine-language program is implemented in software by end-users. Unlike a microprogram, instructions in a machine-language program do not contain information regarding the implementation of the instruction in the datapath. An instruction in a machine-language program is the smallest atom, whereas a machine-language instruction is implemented by a number of microinstructions in a microprogram. 3. MIPS ISA i) In the MIPS instruction set, for Jump instructions, only 26 bits of the target address are available in the instruction (the other 6 bits are the opcode). Why didn't the designers choose to provide more than 26 bits of the target address in the instruction? Extending the target address to a full 32-bits is not necessary, since the last two bits of the PC are always 00...the maximum bits necessary for the target address would be 30 bits. But implementing more than 26-bits of the target address would have made the jump instructions larger than 32-bits. This would either make pipelining more difficult if jump instructions were larger than 32-bits and all other instructions were 32-bits, or would result in wasted code size if all instructions were lengthened to match the now larger jump instruction size. ii) How is a 32-bit jump target address for a jump instruction calculated? The 26-bit target address is shifted left by 2bits to produce bits 27-0 of the new PC. Bits 31-28 of the new PC = bits 31-28 of PC+4. iii) In MIPS, why is the offset of a branch instruction from the PC of the next instruction instead of the PC of the current instruction? The PC of the next instruction can be computed in the first stage of the pipeline. By making the result of the branch be at this value + offset, only one more arithmetic operation is necessary in the event of a branch. If the branch target was the PC + offset instead of PC + 4 + offset, the current value of the PC would have to be subtracted by 4 before computing the branch target. iv) Why is the branch offset shifted left by 2 bits while the displacement for loads and stores are not shifted? Because instructions are four bytes long, their addresses always have their two least-significant bits = 0, which is what the shift accomplishes. Data needs to be byte-addressable, so displacements are not shifted. 4. System Performance Summary: base system spends 88% of time computing, 12% time waiting for disk. Integer CPI = 1, 40% of instructions executed FP CPI = 4, 30% of instructions executed Other CPI = 2, 30% of instructions executed i) a. Total computation time reduced by 30% Speedup = 1/(0.88 * 0.7 + 0.12) = 1.36 b. Disk wait time reduced by 80% Speedup = 1/(0.88 + 0.12 * 0.2) = 1.11 c. FP CPI reduced to 2. Old CPI = 0.4 * 1 + 0.3 * 4 + 0.3 * 2 = 2.2 New CPI = 0.4 * 1 + 0.3 * 2 + 0.3 * 2 = 1.6 Computation speedup = 2.2 / 1.6 = 1.375 Total speedup = 1 / (0.88 / 1.375 + 0.12) = 1.32 ii) Modification in part a has the best speedup. iii) Part b: If disk wait time was 0: speedup = 1/0.88 = 1.14 <= not faster than part a. Part c: If FP CPI = 0: New CPI = 0.4 * 1 + 0.3 * 2 = 1, Computation speedup = 2.2 / 1.0 = 2.2 Total speedup = 1 / (0.88 / 2.2 + 0.12) = 1.92 <= faster than part a. 5. Pipelining Register writes occur in the first half of the cycle, reads in the second half. There are NO bypasses in the pipeline. instruction sequence: R2 <= MEM(R1 + R3) LOAD R3 <= R2 + R4 ADD1 R1 <= R3 - R4 SUB R1 <= R5 + R3 ADD2 MEM(R4 + R6) <= R5 STORE a) Cycles IF ID EX MEM WB 1 LOAD 2 ADD1 LOAD 3 SUB ADD1 LOAD 4 SUB ADD1 bubble LOAD 5 SUB ADD1 bubble bubble LOAD 6 ADD2 SUB ADD1 bubble bubble 7 ADD2 SUB bubble ADD1 bubble 8 ADD2 SUB bubble bubble ADD1 9 STORE ADD2 SUB bubble bubble 10 STORE ADD2 SUB bubble 11 STORE ADD2 SUB 12 STORE ADD2 13 STORE b) Now assume that register writes occur in the SECOND half of the clock cycle and reads occur in the first half. Draw a bypass so that no bubbles are added to the code sequence from part a. The important thing to realize is that in this case, a bubble would be added if an instruciton is in the write-back stage while another instruction that uses the value is in the register fetch stage. A bypass must present the correct value to the input of the ALU when the instruction needing it is the EX stage. But in this case, the instruction producing the value is no longer in the pipeline when the instruction using the value is in the EX stage. So the bypass must go from the output of the MEM/WB latch to the INPUT of the ID/EX latch. That way, when the instruction enters the EX stage, the bypassed value will still be in the pipeline. c) Eliminate all bubbles by adding a bypass (independent of the modifications of part b) and by reordering the code sequence. Both a bypass and a reordered code sequence is required to eliminate all bubbles. The first set of bubbles after the load instruction cannot be completely eliminated by a bypass. This is because ADD1 right after it is dependent, so when it wants to enter the EX stage, the LOAD will be accessing memory. Therefore a non-dependent instruction must be inserted before ADD2; the only instruction that satisfies this condition is STORE. There still are two bubbles caused by dependencies between instructions in the EXE or WB stage and those wanting to enter the EX stage. Thus bypasses must be added from the output of the MEM/WB and EXE/MEM latches to the inputs of the ALU. Here's the reordered code sequence: LOAD STORE ADD1 SUB ADD2 6. Datapath Implement a SLT instruction: SLT Rd, Rs, Rt. If the contents of register Rs is less than the contents of register Rt, then 1 is written to Rd. Otherwise, 0 is written to Rd. The values "0" and "1" are zero-extended to 32-bits. Bits: 31-26 = opcode 25-21 = Register Rs 20-16 = Register Rt 15-11 = Register Rd 5-0 = function field Changes to datapath: A zero-extending unit is added that takes bit 31 of ALUout (the sign bit of the ALU output) as its input. A third input is added to the MemToReg mux, which takes the output of the new zero-extender as its input. The MemToReg signal is widened to 2 bits. Control: Cycle Functional Description Signals ------------------------------------------------------------------------------------------- 1 Instruction -> IR MemRd = 1, ALUSrcA = 0, IorD = 0, IRWrite = 1 and PC = PC + 4 ALUSrcB = 01, ALUOp = 00, PCWrite = 1, PCSource = 00 2 Read Rs and Rt, ALUSrcA = 0, ALUSrcB = 11, ALUOp = 00, compute branch target MemRd = 0, PCWrite = 0, IRWrite = 0 3 Subtract Rs - Rt ALUSrcA = 1, ALUSrcB = 00, ALUOp = 10 4 Write to Rd MemToReg = 10, RegDst = 0, RegWrite = 1