(3.5.1) Thread Level Speculation

J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry. A Scalable Approach to Thread-Level Speculation. In Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000. ACM DL link

parallelizing non numeric and irregular numeric difficult (complex control flow and memory access patterns)

TLS : auto parallelize code 
Epochs : time stamped with a epoch number (ordering)
Track data dependency between epochs           
homefree token :  when guaranteed no violations, this epoch can commit.

Objectives
     large-scale parallel machine (single chip mups or SMT) seamlessly perfom TLS across entire machine > communication diff
     no recompilation 

Detect data dependence violation at run time > leverage invalidation based cache coherence.
     if an invalidation arrives from a logically-earlier epoch for a line that we have speculatively loaded > bonkers.

Speculation level > Cache at which speculation occurs

whats required
     (i) notion of whether a  a cache line has been speculatively loaded and/or modified
     (ii) guarentee that a pec cache line will not be propagated to regular memory
          spec fails if cache line is replaced!
     (iii) ordering of all spec mmory references (epoch numbers and homefree token)

Hardware 
Cache states
     Apart from Dirty, Shared, Exclusive and Invalid, speculatively loaded and specul modified. 
     no kicking out till that epoch becomes home free. (if a must, speculation fails)
Messages 
     read-exclusive-speculative, invalidation speculative, upgrade request speculative 
     + epoch number of the requester 
     > only hints, no real need to oblige.

when speculation succeeds
     instead of scanning all lines and changing spec states to normal
     Ownership required buffer > when a line becomes both speculatively modified and shared.
     when home free token arrives, generate upgrade request  for each entry in ORB. 

Optimizations
     Forwarding data between epochs : have wait-signal synchronization
     Dirty and spec loaded state : anyway speculated dirty is never evicted, so if you load and then modify, just store it as DSpL state
     suspend the epochs that have violations (resume when u get homefree token)
     Support for multiple writers : combine results (ah!) fine grained SM bits (bytes/words in cache line)

support in SMT machines
     (i) two epochs may not modify the same line
     (ii) t- epoch cannot see t+ modification
      

Conclusions
     8-75% speedups! 
     ORB overhead low (used to commit speculative modifications faster)