Title: Complexity-Effective Superscalar Processors Authors: Subbarao Palacharla, Norman P. Jouppi and J. E. Smith Conf: ISCA 1997 Context of the paper, Motivation/ Problems looked at, Overview of mechanisms proposed, Trade-offs, interesting results/ take-away points Context: Early superscalar processors, performance tradeoff between hardware complexity and clock speed. Motivation: 1) Braniacs vs Speed demons. Two possibly conflicting goals - maximize instructions in flight(issue-window), maximize clock frequency. Study mechanisms that lead to increased ILP, and their impact on clock i.e look at mechanisms on critical path. 2) Analyze hardware structures at a micro-architectural level. Characterize complexity w.r.t implementation parameters(underlying technology) and micro-architectural parameters(window size, issue width). Details: Paper looks at instruction dispatch and issue logic, and data bypass logic. Logic associated with these likely to be key limiters of clock speed. (Still true?) Considering a baseline superscalar model without reorder buffers. Proposals: Dependence-based architecture that groups dependent instructions rather than independent ones. Not the major contribution of the paper. Complexity Analysis: Basic Structures looked at: Insn Dispatch - Register rename logic Insn Issue - Wakeup logic, Selection logic Data Bypass - bypass logic Methodology: First, representative CMOS circuits for these hardware structures are selected. Second, circuits are optimized for speed. Register Rename logic: RAM based or CAM based. CAM less scalable 'cause the number of CAM entries(= No. physical regs) increases with issue width(??). Window size is not a factor, and the issue width affects delay though its impact on wire lengths. ** Wire delays will become increasingly important as feature sizes are reduced ** Wakeup logic: 2*IssueWidth comparators per instruction in the issue window. Issue width has a greater impact on the delay than window size. Selection logic: Tree-based scheme. Delay increases logarithmically with window size. Total delay scales well with feature size. Data Bypass logic: Number of bypass paths grows quadratically with issue width. Bypass delay grows quadratically with issue width. Complexity-Effective Microarchitectures: Nice idea. FIFOs containing dependent instructions. Determine dependencies early. Lose out on performance(IPC) but better clock speed. Clustering Dependence-based Microarchitecture. - Single-Window, Execution-Driven Steering - Two windows, Dispatch-Driven Steering. - Two Windows, Random Steering Take-away points: 1) Window-logic and data bypass on critical path. As you improve these, other structures could become critical. 2) Design complexity-effective structures i.e those that give good ILP and while facilitating a faster clock.