--------------------------------------------------------------------
CS 757 Parallel Computer Architecture
Spring 2012 Section 1
Instructor Mark D. Hill
--------------------------------------------------------------------

------------
Coherence 4 -- Directories, cont.
------------

OUTLINE
* Directory races
* PutS
* Case Studies: SGI Origin, AMD, Intel QPI
* Liveness
* Advance Topics
* Victim Replication
* Virtual Hier


------------------------------

Chapter 8
---------
cont. 

Directory Races
* (Non-stalling dir protocol) OMIT
* Interconnects w/o point-to-point ordering, e.g., Fig. 8.12 & 8.13 -- handwave


Silent S replacement or PutS
Silent: saves PutS+ack but not is other core is not to do GetM (as then inv+ack)
Explicit: allows dir state to be precise (E state later, fewer recalls); fewer races


SGI Origin 2000
* Bit vector for small system; coarse vector for larger
* no point-to-point orders, so lots of races
* Non-ownership E state (so in E both dir and owner may send data)


AMD Coherence HyperTransport
* First version Dir[0]B
   + to home node "dir", bacst to all (not totally-orderered), all respone to req, req unblocks dir
   (also prefetch from memory while waiting to others; home can send memory
   data to requestor that may be overridden by owner; in this case usually req
   cancels memory)
   + simple, non-scalable, best/worst of snooping/directories?
* Then Add Probe Filter

Intel QPI 

* Add F (Forward) state 
    + clean & silently evicted like S
    + single-owner like O
    + provide data from "closer"
* Home Snoop Mode
    + Actually directory protocol at home node
* Source Snoop Mode
    + req bcasts (not totally-ordered); all respond to home (also owner may
    send data to req), home unblocks requestor
    + racing req order by who got to previous owner first (not dir)


Chapter 9
---------

9.3 MAINTAINING LIVENESS

Deadlock
* Want 2 or more resources
* Wont give up held resources
* Cycle in graph of order that anyone obtains resources

Protocol -- wait for message than is being blocked from sending
Resource -- e.g., must allocate a buffer to do something, none are available,
and none will be freed until blocked work is done
 * Example solutions: requestor always allocates buffer to data response
 * before put out request; a core can respond to OTHERS request without
 * allocating a resource (e.g., other GetM when we're in state M).
Interconnect Deadlocks
* E.g., routing -- use virtual channels
* E.g., protocol request/response -- two virtual networks

Livelock
* Dualing GetM might never execute a store
* Hold multiple block to do instruction might deadlock
* Solution: For oldest instruction only, do at least one instruction when get
* block

Starvation
* Arbitration
* NACKs

OLD NOTES {{
Deadlock, etc.  

Prevent:

Deadlock: 
* all system activity ceases
* Cycle of resource dependences
Livelock: 

* no processor makes forward progress 
* constant on-going transactions at hardware level
* e.g. simultaneous writes in invalidation-based protocol

Starvation: 
* some processors make no forward progress
* e.g. interleaved memory system with NACK on bank busy

Examples

Request-reply protocols can lead to deadlock
* When issuing requests, must service incoming transactions
* e.g. cache awaiting bus grant must snoop & flush blocks
* else may not respond to request that will release bus: deadlock

How to avoid:
* Responses are never delayed by requests waiting for a response
* Responses are guaranteed to be sunk
* Requests will eventually be serviced since the number of responses is bounded by outstanding requests


Livelock: 
* window of vulnerability problem [Kubi et al., MIT]
* Handling invalidations between obtaining ownership & write
* Solution: don't let exclusive ownership be stolen before write

Starvation: 
* solve by using fair arbitration on bus and FIFO buffers
}} END OLD


Instruction Caches
* If code is read-only, who needs coherence?
* Actually, code at least written by DMA (easy to just do coherence)
* x86 allows arbitrary self-modifing code (requires some pipeline compares)

TLBs cache PTs
* If PT written, whither TLB entries incoherent?
* Yes, can have HW support (e.g., IBM Power tblie)
* TLB shootdown more common: stop all cores (via interrupts), synch, zap
* (potential) TLB entries, synch, start all cores

Virtual caches
* Defer address translation until after L1
* Use Wang et al. solution
* Power may make more attractive

Write-thru
* Simple but lots of BW
* Niagara 1-2, AMD Bulldozer
* TSO helps
* Going away, betw. private L1 to private WB L2

Coherence DMA

Multi-level

Migratory Sharing 

False Sharing


----------------------------------

Zhang&Asanovic ISCA05; see Victim Replications Slides;

----------------------------------

Marty&Hill Micro08; see Virtual Hierarchy Slides;
Reviews
* Andy:  Whither "tiled"?
* Syed: VM migration
* Tony: Why might two VM have the same data?
* Aditya: Level 2 in more detail
* Andrew E.:  Disable subsections to save power.
* Daniel: VHnull
* David: more on implementation
* Asim: HW VM support?
* Brian: More on token for VHb
* Eric: HW complexity of VH?
* Guolang: Found bug?
* Marc: Whither consistency models?