Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, Andreas Moshovos: Mechanisms for store-wait-free multiprocessors. ISCA 2007: 266-277. ACM DL link |
2007
This is a cool paper
eliminates Capacity stalls > scalable store buffer, ordering stalls > aso
Store stalls occur in hardware that supports pretty much all consistency models
Why?
Strongly-ordered
retirement must stall at first load following a store miss.
store buffer to L1D, but useless as few stores occur between consecutive loads.
store prefetching and speculative load exec improve performance
store stalls account to 1/3rd exec time
relax load-to-store ratio
buffer all retired store
loads bypass these buffered.
CAM to associatively search
store coalescing > pretty useless
RMW and memory fence > drain store buffer < stall till
models that relax all order
(weak ordering, release consistency, SPARC relaxed memory order (RMO), intel IA-64, HP alpha, IBM PowerPC)
no store buffer pressure. freely coalesced stores
memory sync > store buffer has to drain.
thread sync > ordering delay (unnec when data race occurs)
RMW
Scalable Store Buffering <<
closes gap between TSO and RMO
Conventional store buffer
forwards values to matching loads
maintain order among stores for memory consistency
>> CAMs are used
Total Store Order Buffer (TOSB)
no value forwarding
>> FIFO
If 2 processors update a cache block L1 may be partially invalidated.
CPU private writes (uncommitted stores still in TSOB) must be preserved
merge uncommitted stores with any remote updates.
reloading affected line from memory system and replaying uncommitted stores.
Atomic Sequence Ordering (ASO)
dynamically groups accesses into atomic sequences
coarse grain enforcement of ordering constraints
relaxes ordering among individual accesses.
algo : If (ordering required by processor)
(a) checkpoint, speculatively retire all stores, RMWs, memory fences. > write permission for entire set
(b) hardware creates new checkpoint and new atomic sequence
(c) when all write permissions arrive, the first atomic sequence transitions to commit (hold W till complete commit)
if another proc writes (data race)
roll back to start of the sequence.
at least one store store before initiating another sequence.
Implementation
(1) must detect when a read in an atomic sequence returns violation
(2) when this occurs check point recovery
(3) buffer all of the seq's write and atomically commit or discard (use SSB)
Hardware Implementation
SSB
TSOB > address, value and sizes of all stores in RAM structure
L1 with sub-blocked cache valid.
16 entry victim buffer for store conflicts
Invalidations > TSOB are replayed to update the updated line
ASO
Atomicity violations : Speculatively read bits