Here's a solution to the midterm with some of the questions summarized: 1. Lookahead adder design a) G''(3) = G'(12) * P'(13) * P'(14) * P'(15) + G'(13) * P'(14) * P'(15) + G'(14) * P'(15) + G'(15) P''(2) = P'(11) * P'(10) * P'(9) * P'(8) C(20) = G''(4) + P''(4) * C16 C(32) = G'''(1) + P'''(1) * G'''(0) + P'''(1) * P'''(0) * C0 C3 = G'(2) + P'(2) * G'(1) + P'(2) * P'(1) * G'(0) + P'(2) * P'(1) * P'(0) * C0 b) Design the lookahead and addition logic for a incrementer. The B input and the generates are zero and C0 = 1. carry-lookahead block: C1 = C0 * p0 C2 = C0 * p1 * p0 C3 = C0 * p2 * p1 * p0 P = p3 * p2 * p1 * p0 adder: sum = a XOR c p = a c) Assume that each gate delay is t time units, and all gates are available with up to 4 inputs. How much faster than the adder would the incrementer be? Assuming XOR gates with delay t: Adder: C48 is ready at t (P' and G') + 2t (P'' and G'') + 2t (P''' and G''') + 2t (carry) = 7t C63 is ready at 7t + 2t + 2t = 11t Sum63 is ready at 11t + 2t = 13t Incrementer: (assuming first-level propogates are ready at time = t) C48 is ready at 0 (P') + t (P'') + t (P''') + t (carry) = 3t C63 is ready at 3t + t + t = 5t Sum63 is ready at 5t + 2t = 7t 2. System performance question. The base system spends 82% of the time computing and 18% of the time waiting for the disk. Integer instructions (40% of executed instructions) have a CPI of 1, floating-point instructions (30%) have a CPI of 5, and other instructions (30%) have a CPI of 2. i. a) The processor is replaced with one that reduces computation time by 35%. Speedup = 1 / ((1 - 0.82) + 0.82 * 0.65) = 1.40 b) The disk is replaced with one that reduces disk wait time by 85%. Speedup = 1 / ((1 - 0.18) + 0.18 * 0.15) = 1.18 c) The floating-point CPI is changed to 3. Average CPI (old) = 0.40 * 1 + 0.30 * 5 + 0.30 * 2 = 2.5 Average CPI (enhanced) = 0.40 * 1 + 0.30 * 3 + 0.30 * 2 = 1.9 Speedup (computation) = 2.5 / 1.9 = 1.316 Speedup = 1 / ((1 - 0.82) + 0.82 / 1.316) = 1.245 ii. Part a results in the best speedup iii. If the disk was infinitely fast: speedup = 1 / (1 - 0.18) = 1.22 <= still slower than part a If the floating-point computation was infinitely fast: Average CPI (old) = 2.5 (from part i) Average CPI (enhanced) = 0.40 * 1 + 0.30 * 0 + 0.30 * 2 = 1 Speedup (computation) = 2.5 / 1 = 2.5 Speedup = 1 / ((1 - 0.82) + 0.82 / 2.5) = 1.969 <= better speedup than part a 3. MIPS ISA i. In MIPS, why is the offset of a branch instruction from the PC of the next instruction instead of the PC of the current instruction? The PC of the next instruction can be computed in the first stage of the pipeline. By making the result of the branch be at this value + offset, only one more arithmetic operation is necessary in the event of a branch. If the branch target was the PC + offset instead of PC + 4 + offset, the current value of the PC would have to be subtracted by 4 before computing the branch target. ii. What are MIPS SET instructions and why are they useful? SET instructions are comparison instructions. For the instruction slt r1, r2, r3, if r2 < r3, then r1 = 1, else r1 = 0. They are used for the conditions of branch instructions so that the branch condition testing can be simplified (simply testing a bit in a register), thus allowing branches to be resolved earlier in the pipeline. iii. Why are the offsets for branch instructions and displacements for load and store instructions in the MIPS ISA limited to 16-bits? The opcode and two register addresses use up 16 bits of the instruction. To support offsets greater than 16 bits, longer fixed-length instructions or variable-lengthed instructions would be required. iv. Why is the branch offset shifted left by 2 bits while the displacement for loads and stores are not shifted? Because instructions are four bytes long, their addresses always have their two least-significant bits = 0, which is what the shift accomplishes. Data needs to be byte-addressable, so displacements are not shifted. 4. Short answer questions i. What is system balance? System balance ensures that particular components of a system do not present a bottleneck for performance. In a balanced system the capabilities of a unit (e.g., the bandwidth) are equal to the capabilities that other units demand of it. This affects the design of the individual components so that no component is either over- or under-designed. ii. In a typical 5 or 6 stage pipeline, the CPI might be in the range of 1.0 to 1.5. Does this mean that most instructions have a latency of 1 or 2 cycles? No; in a pipeline, an instruction may be completed every cycle or two due to parallelism and overlap, but the instruction latency is still bound by the length of the pipeline. iii. Why do conditional branches impact the performance of a pipelined implementation? Conditional branches present a control hazard that can stall instruction fetch and thus created bubbles in the pipeline. iv. 3 solutions to reduce impact of branches in a pipeline: - Delayed branches: change the semantics of the branch to always execute the instruction immediately following the branch regardless of branch outcome, and have the compiler insert a non-dependent instruction in this branch delay slot. - Rudimentary Static Branch Prediction: Continue executing instruction from the not-taken path of the branch and squash instructions if the branch is taken. - Move up the branch resolution point: resolve branches in the ID stage rather than in the MEM or WB stages, for example using SET instructions following by simple branch instructions. 5. Pipelining Register writes occur in the first half of the cycle, reads in the second half. There is a bypass from the output of the EX/MEM latch to the inputs of the ALU. instruction sequence: r3 <- r2 + r1 ADD1 load r2 <= mem (r3 + r1) LOAD r2 <- r2 + r1 ADD2 r4 <- r3 - r1 SUB r1 <- r2 + r1 ADD3 a) Cycles IF ID EX MEM WB 1 ADD1 2 LOAD ADD1 3 ADD2 LOAD ADD1 4 SUB ADD2 LOAD ADD1 5 SUB ADD2 bubble LOAD ADD1 6 SUB ADD2 bubble bubble LOAD 7 ADD3 SUB ADD2 bubble bubble 8 ADD3 SUB ADD2 bubble 9 ADD3 bubble SUB ADD2 10 ADD3 bubble SUB 11 ADD3 bubble 12 ADD3 b) Now assume that register writes occur in the SECOND half of the clock cycle and reads occur in the first half. Draw a bypass so that no bubbles are added to the code sequence from part a. The important thing to realize is that in this case, a bubble would be added if an instruciton is in the write-back stage while another instruction that uses the value is in the register fetch stage. This happens twice in the code sequence; with LOAD and ADD2 in cycle #6 and with ADD2 and ADD3 in cycle #9. A bypass must present the correct value to the input of the ALU when the instruction needing it is the EX stage. But in this case, the instruction producing the value is no longer in the pipeline when the instruction using the value is in the EX stage. So the bypass must go from the output of the MEM/WB latch to the INPUT of the ID/EX latch. That way, when the instruction enters the EX stage, the bypassed value will still be in the pipeline. c) Eliminate all bubbles by adding a bypass (independent of the modifications of part b) and by reordering the code sequence. Both a bypass and a reordered code sequence is required to eliminate all bubbles. The first set of bubbles after the load instruction cannot be completely eliminated by a bypass. This is because ADD2 right after it is dependent, so when it wants to enter the EX stage, the LOAD will be accessing memory. Therefore a non-dependent instruction must be inserted before ADD2; the only instruction that satisfies this condition is SUB. There still are two bubbles caused by dependencies between instructions in the WB stage and those wanting to enter the EX stage. Thus a bypass must be added from the output of the MEM/WB latch to the inputs of the ALU. Here's the reordered code sequence: ADD1 LOAD SUB ADD2 ADD3 6. Datapath question Implement a LDINC Rs, offset (Rt) instruction. This performs a load using the addressed by the sum of the sign-extended offset (bits 15-0) and the value in Rt (bits 25-21) into register Rs (bits 20-16). The value of Rt is also incremented by 4. The one modification to the datapath that is necessary is that a new input is required on the RegDst mux that takes bits 25-21 of the instruction as the input. This is so that the register specified by Rt can be written to. Cycle Functional Description Signals ------------------------------------------------------------------------------------------- 1 Instruction -> IR MemRd = 1, ALUSrcA = 0, IorD = 0, IRWrite = 1 and PC = PC + 4 ALUSrcB = 01, ALUOp = 00, PCWrite = 1, PCSource = 00 2 Read Rs and Rt, ALUSrcA = 0, ALUSrcB = 11, ALUOp = 00, compute branch target MemRd = 0, PCWrite = 0, IRWrite = 0 3 Compute mem address ALUSrcA = 1, ALUSrcB = 10, ALUOp = 00 4 Access memory MemRd = 1, IorD = 1 5 write-back, increment Rt ALUSrcA = 1, ALUSrcB = 01, ALUOp = 00, RegWrite, MemToReg = 1, RegDst = 00 6 write back Rt RegWrite, MemToReg = 0, RegDst = 10