(2.3.3) Compiler Technology for Future Microprocessors

W. W. Hwu, R. E. Hank, D. M. Gallagher, S. A. Mahlke, D. M. Lavery, G. E. Haab, J. C. Gyllenhaal, and D. I. August. Compiler Technology for Future Microprocessors. Proceedings of the IEEE 83(12), December 1995. IEEE Xplore link



Paper talks about : 1995
Critical components of tech which deals with difficult problems that are encountered when compiling programs for a high degree of ILP.  [ In particular : predicated compile and dependence info maintainance ]

Compilers improve program exec speed by
     eliminating unnec instr
     putting stuff used most in registers. (reduce memory accesses). 
     Make use of hardware parallelism requried.

To parallelise a program (ILP compilation)
     Dependency extraction
     ILP optimizations
     Core Scheduling (reorder the instructions) 

Definitions (and why)
basic blocks : control in control out block
Control flow graph : control dependence
Register data dependence :  flow dependence, anti-dependence (WAW), loop-independence (within the same iterations in a loop), loop-carried (over the loop boundary) == register output dependence.
Memory data dependence :  loop independent memory anti-dependence : inside loop unnecessary wait due to overwrite fear.
loop carried memory antidependence. memory input dependence : 2 instr read from the same memory location (can remove redundant loads)

Go beyond basic blocks : Scheduling

Superblock : formed from the trace by dupilcating blocks to remove merge points and their associated complications from the trace.
reorder instr with the superblock : global acyclic (in the control flow graph) scheduling.
Speculative code motion  : move instructions.

Techniques : Register renaming/Loop unrolling / software pipelining (cyclic scheduling). 
for (i=1) to bignumber A(i); B(i); C(i) end >> A(i) A(i+1) A(i+2) B(i)... in software. 

Predicated Execution
Transformations of conditional branches into predicates : control depencences > to data dependences. 
>>> Predicated exec allows the compiler to trade instr fetch efficiency for the capability to expose ILP to the hardware along multiple execution paths. 

Hyperblock : connected basic blocks where control may only enter in the first block, and exit anywhere. 
(tradeoff : branch performance with resource usage, hazard conditions and critical path lengths).

Branch combining replaces a group of exit branches with a corresponding group of predicate define instr. now OR-type sematics can be used, once you hit the exit code, find out why you got to the exit code, jump to that exit destination. 

Loop peeling : compiler peels away first several iterations of a loop and combined with the outer loop. 

Control height reduction : dependence chain length to compute execution contidtions of instr