(3.5.4) Store-wait-free MuP

Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, Andreas Moshovos: Mechanisms for store-wait-free multiprocessors. ISCA 2007: 266-277. ACM DL link

2007

This is a cool paper
eliminates Capacity stalls > scalable store buffer, ordering stalls > aso

Store stalls occur in hardware that supports pretty much all consistency models
Why?
Strongly-ordered
     retirement must stall at first load following a store miss. 
     store buffer to L1D, but useless  as few stores occur between consecutive loads.
     store prefetching and speculative load exec improve performance
     store stalls account to 1/3rd exec time
relax load-to-store ratio
     buffer all retired store
     loads bypass these buffered. 
     CAM to associatively search 
     store coalescing > pretty useless
     RMW and memory fence > drain store buffer < stall till
models that relax all order
     (weak ordering, release consistency, SPARC relaxed memory order (RMO), intel IA-64, HP alpha, IBM PowerPC)
     no store buffer pressure. freely coalesced stores
     memory sync > store buffer has to drain.
     thread sync > ordering delay (unnec when data race occurs)

RMW
     
Scalable Store Buffering << 
     closes gap between TSO and RMO
     Conventional store buffer
          forwards values to matching loads
          maintain order among stores for memory consistency
          >> CAMs are used
     Total Store Order Buffer (TOSB)
          no value forwarding
          >> FIFO
          If 2 processors update a cache block L1 may be partially invalidated. 
               CPU private writes (uncommitted stores still in TSOB) must be preserved
               merge uncommitted stores with any remote updates. 
               reloading affected line from memory system and replaying uncommitted stores.

Atomic Sequence Ordering (ASO)
     dynamically groups accesses into atomic sequences
     coarse grain enforcement of ordering constraints
     relaxes ordering among individual accesses. 
          algo : If (ordering required by processor)
               (a) checkpoint, speculatively retire all stores, RMWs, memory fences. > write permission for entire set
               (b) hardware creates new checkpoint and new atomic sequence
               (c) when all write permissions arrive, the first atomic sequence transitions to commit (hold W till complete commit)
          if another proc writes (data race)
               roll back to start of the sequence. 
               at least one store store before initiating another sequence. 
     Implementation
          (1) must detect when a read in an atomic sequence returns violation
          (2) when this occurs check point recovery
          (3) buffer all of the seq's write and atomically commit or discard (use SSB)

Hardware Implementation
     SSB
          TSOB > address, value and sizes of all stores in RAM structure
          L1 with sub-blocked cache valid.
          16 entry victim buffer for store conflicts
          Invalidations > TSOB are replayed to update the updated line
     ASO
          Atomicity violations : Speculatively read bits