-------------------------------------------------------------------- CS 757 Parallel Computer Architecture Spring 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- ------------ Coherence 1 ------------ Outline Review System Model & Coherence Invariants Specifying Coherence Protocols MOESI States Simple Snooping Non-Atomic Requests Exclusive (E) and Owned (O) States (now or next time) Review System Model & Coherence Invariants ------------------------------ System model -- Figure 2.1 cores w/ private caches w/ controllers icn LLC w/ memory w/ controller Offchip DRAM Coherence Invariants ------------------------------ 1. Single-Writer, Multiple-Read (SWMR) Invariant. For any memory location A, at any given (logical) time, there exists only a single core that may write to A (and can also read it) or some number of cores that may only read A. 2. Data-Value Invariant. The value of the memory location at the start of an epoch is the same as the value of the memory location at the end of its last read–write epoch. Show Figure 2.3 timeline read-write at 1 read-only at 2, 3 read-write at 2 .. Maintiain invariants * Use (64B blocks) * FSM at caches & LLC * communication with message/bus Goal: * Make cache is invisible as in uniprocessors * Once invisible, what does memory do? (can't refer to caches) Specifying Coherence Protocols ------------------------------ FSMs communicating via messages. Use Table: Row for state and transient states Columns for events -- core request and incoming messages VI: I-->V Own-Get_DataResp V-->I Own--Put or Other-Get Go over Table 6.2 Transient state IV[D] Note: P1's cache has a virtual FSM per block Pi's cache has same FSM (but may in in different state) Memory (LLC) also have virtual FSM per block (difference from cache FSM) -- See Table 6.3 MOESI ----- Validity Dirtiness Exclusivity Owned Modified X X X X (Owned) X X (Exclusive) X X Shared X Invalid Stable states stored in cache, e.g., ceiling(log2(5)) = 3 bits Transient states in MSHRs Common Transactions: GetS GetM Upgrade (PutS) (PutE) PutO PutM Common Requests: load, store, RMW, i-fetch, RO-prefetch, RW-prefetch, replace Protocol Taxonomy (simplified) * Snooping: totally-ordered broadcast (Chapter 7) * Directory: point-to-point message with level of indirection (Chapter 8) Write Invalidate vs. Write Update * assumed write invalidate & this is more common * write update -- hard to implement memory consistency models and too much traffic ===================== Snooping (Chapter 7) ===================== Simple -- -------------------- * Atomic Requests (request ordered same cycle it is issued) * Atomic Transactions (no other request to SAME block until transaction done) Show * $ FSM Figure 7.1 * mem FSM Figure 7.1 * system Go over FSMs in Figures 7.5 and 7.6 shaded -- not possible blank -- no action Mem: IorS[D] -- memory waiting for writeback as part of cores doing M to S transistion store in M is GetM/SM[D] -- sends data reduntanly -- could have Upgrade Figure 7.8-7.9 -- Non-atomic request (e.g., queue to get on bus), Atomic Transactions Store in I, send GetM ==> IM[AD], see own GetM (ordering point), could "do" store ==> IM[D], gets data ==> M, finishes store Consider "window of vulnerability" Store in S, send GetM/Upgrade ==> SM[AD], see OTHER GetM so invalidate ==> IM[AD], own GetM ==> IM[D], gets data ==> M, finishes store Makes upgrade transacton trickier Normal Writeback writeback in M, send PutM ==> MI[A], see own PutM, send data ==> I Writeback racing other GetM writeback in M, send PutM ==> MI[A], see other GetM, send data ==> II[A], see own PutM ==> I Exclusive (E) State -------------------- * Idea: on GetS, if no other sharers, goto E instead of S * If subsequent store, silently go to M * If other Gets, silently go to S * If replace in E, treat like S (silent replacement) Important to (mostly) private data * Otherwise read miss (GetS) then write miss (GetM) * With E read mis (GetS) and then silent upgrade to M How implement "if no other sharers" * Add state to LLC if there is (was) at least one sharer * Before LLCs, often had logical "wired" OR of sharers -- shared line [Before LLCS, memory often found out whether there was an M block with OR as well -- owned line] Look at FSMs 7.4 and 7.5 Owned (O) State -------------------- Advantages Otherwise in M 7 see other GetS, must send data to BOTH requestor and LLC O state eliminates extra data message and LLC updates Historialy, O also allowed subsequent GetS to be source from cache -- which could be faster than memory (old days) but probably not LLC See FSM in Figure 7.6 and 7.7