(3.5.4) Store-wait-free MuP

Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, Andreas Moshovos: Mechanisms for store-wait-free multiprocessors. ISCA 2007: 266-277. ACM DL link

2007

This is a cool paper

eliminates Capacity stalls > scalable store buffer, ordering stalls > aso

Store stalls occur in hardware that supports pretty much all consistency models

Why?

Strongly-ordered

retirement must stall at first load following a store miss.

store buffer to L1D, but useless as few stores occur between consecutive loads.

store prefetching and speculative load exec improve performance

store stalls account to 1/3rd exec time

relax load-to-store ratio

buffer all retired store

loads bypass these buffered.

CAM to associatively search

store coalescing > pretty useless

RMW and memory fence > drain store buffer < stall till

models that relax all order

(weak ordering, release consistency, SPARC relaxed memory order (RMO), intel IA-64, HP alpha, IBM PowerPC)

no store buffer pressure. freely coalesced stores

memory sync > store buffer has to drain.

thread sync > ordering delay (unnec when data race occurs)

RMW

Scalable Store Buffering <<

closes gap between TSO and RMO

Conventional store buffer

forwards values to matching loads

maintain order among stores for memory consistency

>> CAMs are used

Total Store Order Buffer (TOSB)

no value forwarding

>> FIFO

If 2 processors update a cache block L1 may be partially invalidated.

CPU private writes (uncommitted stores still in TSOB) must be preserved

merge uncommitted stores with any remote updates.

reloading affected line from memory system and replaying uncommitted stores.

Atomic Sequence Ordering (ASO)

dynamically groups accesses into atomic sequences

coarse grain enforcement of ordering constraints

relaxes ordering among individual accesses.

algo : If (ordering required by processor)

(a) checkpoint, speculatively retire all stores, RMWs, memory fences. > write permission for entire set

(b) hardware creates new checkpoint and new atomic sequence

if another proc writes (data race)

roll back to start of the sequence.

at least one store store before initiating another sequence.

Implementation

(1) must detect when a read in an atomic sequence returns violation

(2) when this occurs check point recovery

(3) buffer all of the seq's write and atomically commit or discard (use SSB)

Hardware Implementation

SSB

TSOB > address, value and sizes of all stores in RAM structure

L1 with sub-blocked cache valid.

16 entry victim buffer for store conflicts

Invalidations > TSOB are replayed to update the updated line

ASO

Atomicity violations : Speculatively read bits