Title: Instruction-Level Parallel Processing: History, Overview and Perspective Authors: Bob Rau and Joseph Fisher Journal: Journal of Supercomputing 1993. What's it about? High-level overview of the field of ILP and its development over the 60s, 70s and 80s. What's ILP family of processor and compiler design techniques for executing multiple instructions simultaneously in hardware. More emphasis on compiler design techniques in the paper. Structure of paper - Why ILP and history of development - taxonomy of ILP architectures - Hardware to support ILP - Compiler techniques for ILP - Available ILP How does ILP differ from other forms of parallel processing like vector/array processors or massively parallel computation? Its invisible to user. History and Perspective Horizontal Microcode - idea similar to CISC. Vertical Microcode - RISC Writeable memory caught up with readonly memory stores. Made control store unnecessary. More transistors on chip => u could do more things on chip. CDC6600 and IBM/360 pioneers in exploiting ILP in hardware. Culler, Multiflow, Cydrome - VLIW Taxonomy of ILP Superscalar Processors: Scoreboarding, Tomasulo etc. Superpipelining - Make a machine deeper instead of wider and increase clock speed for same performance. Compiler: rearranges code to make job of hardware easier. Ex: branch-delay slot. Dataflow machines: Explicity encoded dependencies. Computation within a block doesn't provide enough parallelism VLIW: hardware provides a wide machine. Job of compiler to schedule these resources. HARDWARE support for ILP Two approaches - wider, more functional units deeper, more pipelining overlapped execution, pipelining, superpipelining single issue vs multiple issue in-order vs out-order static scheduling vs dynamic Speculation with branch prediction to overcome parallelism limits placed by basic blocks. SOFTWARE support for ILP Scheduling Control flow graph - single vs multiple basic blocks acyclic vs cyclic CFG Underlying machine - finite vs unbounded resources unit latency vs multiple cycle latency simple vs complex resource usage pattern One pass vs multiple pass Local Scheduling - locally optimal. synonymous with local code compaction. Global Acyclic Scheduling - Sum of optimal local != Globally optimal. Need to consider frequency of execution of basic blocks. trace scheduling - goal: reduce length of those sequences of basic blocks frequently executed by the program. off-trace code motion. variant: superblock scheduling. Cyclic Scheduling software pipelining Register Allocation before or after scheduling ? Available ILP limit studies and shortcomings - neglect effect of compiler optimizations on ILP. leads to pessimistic estimates of max. ILP available. Central Question: Is their enough ILP in programs. If yes, what must compiler and hardware do to exploit it? Computer Architecture is a contract between the class of programs that are written for that architecture and the set of processor implementaions of that architecture. Easier Read: VLIW Processors and trace scheduling http://www.cs.utah.edu/~mbinu/coursework/686_vliw/