J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry. A Scalable Approach to Thread-Level Speculation. In Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000. ACM DL link |
TLS : auto parallelize code
Epochs : time stamped with a epoch number (ordering)
Track data dependency between epochs
homefree token : when guaranteed no violations, this epoch can commit.
Objectives
large-scale parallel machine (single chip mups or SMT) seamlessly perfom TLS across entire machine > communication diff
no recompilation
Detect data dependence violation at run time > leverage invalidation based cache coherence.
if an invalidation arrives from a logically-earlier epoch for a line that we have speculatively loaded > bonkers.
Speculation level > Cache at which speculation occurs
whats required
(i) notion of whether a a cache line has been speculatively loaded and/or modified
(ii) guarentee that a pec cache line will not be propagated to regular memory
spec fails if cache line is replaced!
(iii) ordering of all spec mmory references (epoch numbers and homefree token)
Hardware
Cache states
Apart from Dirty, Shared, Exclusive and Invalid, speculatively loaded and specul modified.
no kicking out till that epoch becomes home free. (if a must, speculation fails)
Messages
read-exclusive-speculative, invalidation speculative, upgrade request speculative
+ epoch number of the requester
> only hints, no real need to oblige.
when speculation succeeds
instead of scanning all lines and changing spec states to normal
Ownership required buffer > when a line becomes both speculatively modified and shared.
when home free token arrives, generate upgrade request for each entry in ORB.
Optimizations
Forwarding data between epochs : have wait-signal synchronization
Dirty and spec loaded state : anyway speculated dirty is never evicted, so if you load and then modify, just store it as DSpL state
suspend the epochs that have violations (resume when u get homefree token)
Support for multiple writers : combine results (ah!) fine grained SM bits (bytes/words in cache line)
support in SMT machines
(i) two epochs may not modify the same line
(ii) t- epoch cannot see t+ modification
Conclusions
8-75% speedups!
ORB overhead low (used to commit speculative modifications faster)