-------------------------------------------------------------------- CS 757 Parallel Computer Architecture Spring 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- ------------ Coherence 2 ------------ Outline (two lectures) Exclusive (E) and Owned (O) States (now or last time) Non-Atomic Bus Qualitative Sharing Patterns (Starfire) Hammond: TLS (Hill: Use Simple Mem Models) Review System Model & Coherence Invariants ------------------------------ System model -- Figure 2.1 cores w/ private caches w/ controllers icn LLC w/ memory w/ controller Offchip DRAM Coherence Invariants ------------------------------ 1. Single-Writer, Multiple-Read (SWMR) Invariant. For any memory location A, at any given (logical) time, there exists only a single core that may write to A (and can also read it) or some number of cores that may only read A. 2. Data-Value Invariant. The value of the memory location at the start of an epoch is the same as the value of the memory location at the end of its last read–write epoch. Show Figure 2.3 timeline read-write at 1 read-only at 2, 3 read-write at 2 .. Maintiain invariants * Use (64B blocks) * FSM at caches & LLC * communication with message/bus Goal: * Make cache is invisible as in uniprocessors * Once invisible, what does memory do? (can't refer to caches) Specifying Coherence Protocols ------------------------------ FSMs communicating via messages. Use Table: Row for state and transient states Columns for events -- core request and incoming messages VI: I-->V Own-Get_DataResp V-->I Own--Put or Other-Get Go over Table 6.2 Transient state IV[D] Note: P1's cache has a virtual FSM per block Pi's cache has same FSM (but may in in different state) Memory (LLC) also have virtual FSM per block (difference from cache FSM) -- See Table 6.3 MOESI ----- Validity Dirtiness Exclusivity Owned Modified X X X X (Owned) X X (Exclusive) X X Shared X Invalid Stable states stored in cache, e.g., ceiling(log2(5)) = 3 bits Transient states in MSHRs Common Transactions: GetS GetM Upgrade (PutS) (PutE) PutO PutM Common Requests: load, store, RMW, i-fetch, RO-prefetch, RW-prefetch, replace Protocol Taxonomy (simplified) * Snooping: totally-ordered broadcast (Chapter 7) * Directory: point-to-point message with level of indirection (Chapter 8) Write Invalidate vs. Write Update * assumed write invalidate & this is more common * write update -- hard to implement memory consistency models and too much traffic ===================== Snooping (Chapter 7) ===================== Simple -- -------------------- * Atomic Requests (request ordered same cycle it is issued) * Atomic Transactions (no other request to SAME block until transaction done) Show * $ FSM Figure 7.1 * mem FSM Figure 7.1 * system Go over FSMs in Figures 7.5 and 7.6 shaded -- not possible blank -- no action Mem: IorS[D] -- memory waiting for writeback as part of cores doing M to S transistion store in M is GetM/SM[D] -- sends data reduntanly -- could have Upgrade Figure 7.8-7.9 -- Non-atomic request (e.g., queue to get on bus), Atomic Transactions Store in I, send GetM ==> IM[AD], see own GetM (ordering point), could "do" store ==> IM[D], gets data ==> M, finishes store Consider "window of vulnerability" Store in S, send GetM/Upgrade ==> SM[AD], see OTHER GetM so invalidate ==> IM[AD], own GetM ==> IM[D], gets data ==> M, finishes store Makes upgrade transacton trickier Normal Writeback writeback in M, send PutM ==> MI[A], see own PutM, send data ==> I Writeback racing other GetM writeback in M, send PutM ==> MI[A], see other GetM, send data ==> II[A], see own PutM ==> I Exclusive (E) State -------------------- * Idea: on GetS, if no other sharers, goto E instead of S * If subsequent store, silently go to M * If other Gets, silently go to S * If replace in E, treat like S (silent replacement) Important to (mostly) private data * Otherwise read miss (GetS) then write miss (GetM) * With E read mis (GetS) and then silent upgrade to M How implement "if no other sharers" * Add state to LLC if there is (was) at least one sharer * Before LLCs, often had logical "wired" OR of sharers -- shared line [Before LLCS, memory often found out whether there was an M block with OR as well -- owned line] Look at FSMs 7.4 and 7.5 Owned (O) State -------------------- Advantages Otherwise in M 7 see other GetS, must send data to BOTH requestor and LLC O state eliminates extra data message and LLC updates Historialy, O also allowed subsequent GetS to be source from cache -- which could be faster than memory (old days) but probably not LLC See FSM in Figure 7.6 and 7.7 Non-Atomic Bus -- pipelined or out of order -------------------- See Figures 7.11 NASTY RACES! Store in I, send GetM ==> IM[AD], see OTHER GetM (do nothing -- this core "before" you ==> IM[AD], own GetM ==> IM[D], OTHER GetM (must promise to forward data you don't have) ==> IM[D]I then gets data FOOTNOTE send data to other ==> I FOOTNOTE: ( * If you send data w/o doing instruction, could livelock * If always "hold" data until you can do an instruction could deadlock * Do one instruction if and only if it is oldest Qualitative Sharing Patterns [Weber & Gupta, ASPLOS3] ----------------------------- * Read-Only * Migratory Objects - Maniputalated by one processor at a time - Often protected by a lock - Usually a write causes only a single invalidation * Synchronization Objects - Often more processors imply more invalidations * Mostly Read - More processors imply more invalidations, but writes are rare * Frequently Read/Written - More processors imply more invalidations -------------------- Starfire: Extending the SMP Envelope Alan Charlesworth, Sun Microsystems IEEE Micro, Jan/Feb 1998, pp. 39-49. Notes by Mark D. Hill, 19 Feb 2001 Starfire == Sun Ultra Enterprise 10000 Apt title:-- SMP but no physical bus Same coherence protocols has E6000, etc.: write-invalidate MOESI Up to 64 processors with 4-processors, i/o, and memory per board 4 address buses, interleaved on low-order block bits and implemented w/ ASICs Data network is 144b (16+2 byte) 16x16 crossbar Many active parts on centerplane (Figure 5) 130 ASICs in interconnect! Read miss takes 38 cycles (468 ns) (vs. 18 clocks / 216 ns for E6000) FOLLOW MISS OF TABLE 4 THROUGH FIGURE 4. Domains very important multiple logical machines in one physical machine (like mainframe partitions) In global in interconnect, for each of 16 boards a 16b mask with which boards are in the same partition (& snoops should continue) lots of hardware redundancy why? software migration separating test SMP consolidating several SMPs hard partitioning for cost-accounting, etc. OMIT { The Sun Fireplan Interconnect Alan Charlesworth IEEE Micro Jan-Feb 2002 pp. 36-45 Level Comment ----- ------- 0 1-2 CPUs board w/ two USIII processors/memory (or two PCIs) (or up) 1 8 CPUs (workgroup servers) (or up) 2 24 CPUs (or up) 3 106 CPUs via directory protocol between snoop domains } ------------------------ Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mike Chen, and Kunle Olukotun, The Stanford Hydra CMP, IEEE Micro, March-April 2000, pp. 71-84. With wire delay, more transisitors, growing complexity, and limited ILP, let do a CMP instead? Why not before? * Not enough transistors -- Moore's Law provides * Not enough parallel applications -- so thread-level specualtion (TLS) Base CMP -------- 4-cores write-through L1 caches shared write-back L2 cache read bus -- cache block wide write buses -- word wide simple write-invalidate protocol Buses use multiple segments with repeaters -- logical but not physical buses (Probably not sequentially consistent -- processor consistent?) Thread-Level Speculation (TLS) ----------------------------- Grew out of Wisconsin Multiscalar Project Think Core i executes loop iteration k, 4+k, ... Also subroutines? Fine-grain loops a problem Can do w/o TLS but then worse-case synchronization must maintian memory RAW, WAW, and WAR hazards. Assume no dependencies and have hardware detect Must: 1. Forward data 2. Handle RAW 3. Discard speculative state 4. Handle WAW 5. Handle WAR Show the above via example (with time vertical) Time Iteration i Iteration i+1 | V (1) st A (1) ld A (2) ld B (2) st B (3) discard (4) st C (4) st C (5) st D (5) ld D Hydra adds: * Bits on each L1 cache lines to track speculative reads/writes * Writeback buffer to defer L2 updates until commit (buffers at L2 cache) 1. Forward data: L1 invalidated; read L2 2. Handle RAW: see write to line with speculative read marked 3. Discard speculative state: zap L1 cache and writeback buffers 4. Handle WAW: update L2 in commit order 5. Handle WAR: writes by "later" thread not seen; actually pre-invalidate remembers for when processor is re-assigned (e.g., from k to 4+k) Analysis Performance okay New results for Java JVM better Not accepted commercially -- many CMPs but no announced TLS support Is communication too slow to pseudo-TLP? Review Question * Marc: TLS is Memory, but not Registers * Asim: TLS successful? Shijin & Aditya: TLD used? Brian: Perform? * Guolang: five memory sys requirement; two way to divide -- loops, subroutines * David: question on writes * Daniel: MIPS coprocessor -- not important * Andy: Speedup good? NEEDs TO MOVE @ARTICLE{hill:simple, AUTHOR = "Mark D. Hill", TITLE = "Multiprocessors Should Support Simple Memory Consistency Models", JOURNAL = COMPUTER, YEAR = 1998, VOLUME = 31, NUMBER = "8", PAGES = "28-34", MONTH = "August", WHERE = "bound"} My current thinking [Pai, et al., An Evaluation of Memory Consistency Models ... ILP Processors" ASPLOS96] (in reader) [Hill, Multiprocessors Should Support Simple Memory Consistency Models, Computer, Aug 98] (my web page) [Gniady, Falsafi, & Vijaykumer, Is SC + ILP = RC?, ISCA99] (from Purdue) Most microprocessr will do most of the following Coherence caches Non-binding prefetching Multithreading Speculation The combination of these can make the RC/PC/SC performance differences smaller Go through example Results: RC/PC/SC can do all use the same techniques, but: SC may commit later and exhaust implementation resources SC can have more mis-speculations Quantitatively Using Sarita's ASPLOS numbers, SCimpl to PCimpl reduce execution time 10% SCimpl to RCimpl reduce execution time 16% But These are scientific programs. Database? OSs? We have only begun to fight large caches, active windows, etc. new techniques, see "SC + ILP = RC?" Thus Have hardware implement SC Speculative execution closes performance gap Get this complexity off SW/HW interface so middleware authors can concentrate on their other jobs HW designers get a simple, formal correctness criterion. Reviews * Asum: common? x86, TSO, powerPC (alpha, RMO), (SC for SGI/HP) * David: 20% large? * Brian: Multicore and simpler cores? * Daniel: Gap today? multithreaded writebuffers * Shijin: Why does spec exec reduce SC/RC gap? * Marc: Advocates RC * Andy: W-to-R order models? * Tony & Guoliang: 10-year predictions