-------------------------------------------------------------------- CS 757 Parallel Computer Architecture Spring 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- ------------ Coherence 4 -- Directories, cont. ------------ OUTLINE * Directory races * PutS * Case Studies: SGI Origin, AMD, Intel QPI * Liveness * Advance Topics * Victim Replication * Virtual Hier ------------------------------ Chapter 8 --------- cont. Directory Races * (Non-stalling dir protocol) OMIT * Interconnects w/o point-to-point ordering, e.g., Fig. 8.12 & 8.13 -- handwave Silent S replacement or PutS Silent: saves PutS+ack but not is other core is not to do GetM (as then inv+ack) Explicit: allows dir state to be precise (E state later, fewer recalls); fewer races SGI Origin 2000 * Bit vector for small system; coarse vector for larger * no point-to-point orders, so lots of races * Non-ownership E state (so in E both dir and owner may send data) AMD Coherence HyperTransport * First version Dir[0]B + to home node "dir", bacst to all (not totally-orderered), all respone to req, req unblocks dir (also prefetch from memory while waiting to others; home can send memory data to requestor that may be overridden by owner; in this case usually req cancels memory) + simple, non-scalable, best/worst of snooping/directories? * Then Add Probe Filter Intel QPI * Add F (Forward) state + clean & silently evicted like S + single-owner like O + provide data from "closer" * Home Snoop Mode + Actually directory protocol at home node * Source Snoop Mode + req bcasts (not totally-ordered); all respond to home (also owner may send data to req), home unblocks requestor + racing req order by who got to previous owner first (not dir) Chapter 9 --------- 9.3 MAINTAINING LIVENESS Deadlock * Want 2 or more resources * Wont give up held resources * Cycle in graph of order that anyone obtains resources Protocol -- wait for message than is being blocked from sending Resource -- e.g., must allocate a buffer to do something, none are available, and none will be freed until blocked work is done * Example solutions: requestor always allocates buffer to data response * before put out request; a core can respond to OTHERS request without * allocating a resource (e.g., other GetM when we're in state M). Interconnect Deadlocks * E.g., routing -- use virtual channels * E.g., protocol request/response -- two virtual networks Livelock * Dualing GetM might never execute a store * Hold multiple block to do instruction might deadlock * Solution: For oldest instruction only, do at least one instruction when get * block Starvation * Arbitration * NACKs OLD NOTES {{ Deadlock, etc. Prevent: Deadlock: * all system activity ceases * Cycle of resource dependences Livelock: * no processor makes forward progress * constant on-going transactions at hardware level * e.g. simultaneous writes in invalidation-based protocol Starvation: * some processors make no forward progress * e.g. interleaved memory system with NACK on bank busy Examples Request-reply protocols can lead to deadlock * When issuing requests, must service incoming transactions * e.g. cache awaiting bus grant must snoop & flush blocks * else may not respond to request that will release bus: deadlock How to avoid: * Responses are never delayed by requests waiting for a response * Responses are guaranteed to be sunk * Requests will eventually be serviced since the number of responses is bounded by outstanding requests Livelock: * window of vulnerability problem [Kubi et al., MIT] * Handling invalidations between obtaining ownership & write * Solution: don't let exclusive ownership be stolen before write Starvation: * solve by using fair arbitration on bus and FIFO buffers }} END OLD Instruction Caches * If code is read-only, who needs coherence? * Actually, code at least written by DMA (easy to just do coherence) * x86 allows arbitrary self-modifing code (requires some pipeline compares) TLBs cache PTs * If PT written, whither TLB entries incoherent? * Yes, can have HW support (e.g., IBM Power tblie) * TLB shootdown more common: stop all cores (via interrupts), synch, zap * (potential) TLB entries, synch, start all cores Virtual caches * Defer address translation until after L1 * Use Wang et al. solution * Power may make more attractive Write-thru * Simple but lots of BW * Niagara 1-2, AMD Bulldozer * TSO helps * Going away, betw. private L1 to private WB L2 Coherence DMA Multi-level Migratory Sharing False Sharing ---------------------------------- Zhang&Asanovic ISCA05; see Victim Replications Slides; ---------------------------------- Marty&Hill Micro08; see Virtual Hierarchy Slides; Reviews * Andy: Whither "tiled"? * Syed: VM migration * Tony: Why might two VM have the same data? * Aditya: Level 2 in more detail * Andrew E.: Disable subsections to save power. * Daniel: VHnull * David: more on implementation * Asim: HW VM support? * Brian: More on token for VHb * Eric: HW complexity of VH? * Guolang: Found bug? * Marc: Whither consistency models?