-------------------------------------------------------------------- CS 757 Parallel Computer Architecture Spring 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- ------------ Relaxed Consistency ------------ Outline * Motivation * Example Relaxed Model (XC) * Implementing XC * SC for DRF * Other Relaxed Concepts * High-Level Languages * (R10000) * (Hill 1998) Relaxed Memory Models Motivation /* initially all 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ B = 1; r1 = A; flag = 1; r2 = B; * Recall SC has Each processor generates at total order of its reads and writes L-->L, L-->S, S-->S, S-->L That are interleaved into a global total order * TSO removes S-->L But many (most) order NOT necessary! Why not just enforce necessary orders? * coaelescing write buffer * simpler core speculation (e.g., less power) * weaken coherence -- e.g., use block before getting all directory acks Concepts also important for HLL Example Relaxed Model (XC) Key idea is FENCE: FENCE on Pi does not allow Pi's reference order pass it: L-->F S-->F F-->F F-->L F-->S /* initially all 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ B = 1; FENCE FENCE r1 = A; flag = 1; r2 = B; Also XC orders access to SAME Address A like TSO L(A)-->L(A) L(A)-->S(A) S(A)-->S(A) Ensures: A = 1; while (flag == 0); /* spin */ A = 2; FENCE FENCE r1 = A; /* r1==2, not 1 */ flag = 1; In Summary, XC from book: Operation 2 Op 1 Load Store RMW FENCE F+RMW+F Load A A A X X Store B A A X X RMW A A A X X FENCE X X X X X F+RMW+F X X X X X X == Op1

Op1 CAUSALITY Want 4Ps: * Programmability: EZ to program -- DRF yes; TO close * Performance: could be better than TSO * Portable -- DRF yes; otherwise hard * Precision -- formal definition obtuse Show vend diagrams -- Figure 4.5 * executions * implementations * SC, TSO than PPC and ARM HLL -- Figure 5.6 C++ Java (a) --- ---- c/r compil/runtime binary (b) --------------------------- HW system We have been talking about (b) Most programmes care abotu (a) and down Java Memory Model - Manson, Pugh, & Adve Show picture of HLL memory model Just like HW implementation can reorder, so can compiler Example 1 if (ready==0) {} r1 = data ;; reorder? Example 2 data = r2 ;; reorder ready = 1 if (ready==0) {} r1 = data Register allocation, constant propagation, hoisting from loop, ...` Idea: (1) Make programmers identify synchronization. (2) Tell programmer to make their programs data-race-free (3) If DRF, (a) Programmer can reason with SC (b) Compiler can use many optimizations between synchronization operations (c) Depending on the HW memory consistency model, compiler can insert fenses All of the above due to Adve & Hill, ISCA 1990. New C++ Memory does this as well. (4) What if program is not DRF by error or malicious insent Java's security model requires that we must say what can happen. Want (i) simplicity (ii) most compiler optimzations from 3(b) allowed. Got (ii) but not (i). MIPS R10000 Speculative Cores Address Queue Coherence memory system 1. Core puts speculative loads and stores in address queue (AQ) in program order 2. Speculative loads obtain value for last store to the same address in AQ or coherent memory system 3. Incoming invalidate requests that seek to zap a block that is a target of a pending load in AQ cause a mis-speculation to ensure incorrect load values are discarded Although the address queue executes load and store instructions out of their original order, it maintains sequential memory consistency. The external interface could violate this consistency, however, by invalidating a cache line after it was used to load a register, but before rhat load instruction graduates. In this case, the queue creates a soft exception on the load instruction. This exception flushes the pipeline and aborts that load and all later instructions, so the processor does not use the stale data. --Yeager p35, left column, new paragraph 5: 4. Speculative stores write value into address queue 5. Instruction commit in program order * Loads just removed from address queue * Stores obtain M copy and then write into cache --------------------------------NOT USED------------------------------ @ARTICLE{hill:simple, AUTHOR = "Mark D. Hill", TITLE = "Multiprocessors Should Support Simple Memory Consistency Models", JOURNAL = COMPUTER, YEAR = 1998, VOLUME = 31, NUMBER = "8", PAGES = "28-34", MONTH = "August", WHERE = "bound"} Most microprocessr will do most of the following Coherence caches Non-binding prefetching Multithreading Speculation The combination of these can make the RC/PC/SC performance differences smaller Go through example Results: RC/PC/SC can do all use the same techniques, but: SC may commit later and exhaust implementation resources SC can have more mis-speculations Quantitatively Using Sarita's ASPLOS 1996 numbers, SCimpl to PCimpl reduce execution time 10% SCimpl to RCimpl reduce execution time 16% Thus Have hardware implement SC Speculative execution closes performance gap Get this complexity off SW/HW interface so middleware authors can concentrate on their other jobs HW designers get a simple, formal correctness criterion. But Whither simple cores? Whither power?