Title: Instruction-Level Parallel Processing: History, Overview and Perspective
Authors: Bob Rau and Joseph Fisher
Journal: Journal of Supercomputing 1993.

What's it about?
	High-level overview of the field of ILP and its development over the 60s, 70s and 80s.
	
What's ILP
	family of processor and compiler design techniques for executing multiple instructions simultaneously in 
	hardware. More emphasis on compiler design techniques in the paper.

Structure of paper
	- Why ILP and history of development
	- taxonomy of ILP architectures
	- Hardware to support ILP
	- Compiler techniques for ILP
	- Available ILP
	
How does ILP differ from other forms of parallel processing like vector/array processors or	massively parallel
computation?
	Its invisible to user.
	
History and Perspective
	Horizontal Microcode - idea similar to CISC.
	Vertical Microcode - RISC
	Writeable memory caught up with readonly memory stores. Made control store unnecessary.
	More transistors on chip => u could do more things on chip.
	CDC6600 and IBM/360 pioneers in exploiting ILP in hardware.
	Culler, Multiflow, Cydrome - VLIW 

Taxonomy of ILP 

	Superscalar Processors:
		Scoreboarding, Tomasulo etc.
		Superpipelining - Make a machine deeper instead of wider and increase clock speed for same performance.
		Compiler: rearranges code to make job of hardware easier. Ex: branch-delay slot.
	Dataflow machines:
		Explicity encoded dependencies. 
		Computation within a block doesn't provide enough parallelism
	VLIW:
		hardware provides a wide machine. Job of compiler to schedule these resources.


HARDWARE support for ILP
	Two approaches - wider, more functional units
					 deeper, more pipelining
	overlapped execution, pipelining, superpipelining
	single issue vs multiple issue
	in-order vs out-order
	static scheduling vs dynamic
	Speculation with branch prediction to overcome parallelism limits placed by basic blocks.
	
SOFTWARE support for ILP
	Scheduling
		Control flow graph - single vs multiple basic blocks
							 acyclic vs cyclic CFG
		Underlying machine - finite vs unbounded resources
							 unit latency vs multiple cycle latency
							 simple vs complex resource usage pattern
		One pass vs multiple pass
		
		Local Scheduling - locally optimal. synonymous with local code compaction.
		Global Acyclic Scheduling - Sum of optimal local != Globally optimal. Need to consider frequency of execution of basic blocks.
			trace scheduling - goal: reduce length of those sequences of basic blocks frequently executed by the program.
				off-trace code motion.
			variant: superblock scheduling.
		Cyclic Scheduling
			software pipelining
	Register Allocation
		before or after scheduling ?
		
Available ILP 
	limit studies and shortcomings - neglect effect of compiler optimizations on ILP. leads to pessimistic estimates of max. ILP available.
												 
Central Question:
	Is their enough ILP in programs. If yes, what must compiler and hardware do to exploit it?

Computer Architecture is a contract between the class of programs that are written for that architecture and the set of processor implementaions of that architecture.
	
Easier Read:
VLIW Processors and trace scheduling							 						 					  	
http://www.cs.utah.edu/~mbinu/coursework/686_vliw/