-------------------------------------------------------------------- CS 758 Programming Multicore Processors Fall 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- ------------ Synchronization, etc. ------------ OUTLINE REDO * Atomicity & Condition Synchronization * Safety and Liveness * Memory Consistency Michael Scott: Shared-Memory Synchronization Synthesis Lecture Chapter 1-3 Atomicity & Condition Synchronization Atomicity thread 1: thread 2: ctr++ ctr++ thread 1: thread 2: 1: r := ctr 1: r := ctr 2: inc r 2: inc r 3: ctr := r 3: ctr := r (6 3) = 20 interleaving Many don't increment counter by 2. lock L thread 1: thread 2: L.acquire() L.acquire() ctr++ ctr++ L.release() L.release() What if many counters? (think: hash table or tree) Coarse-grain locking lock L thread 1: thread 2: L.acquire() L.acquire() ctr[i]++ ctr[j]++ L.release() L.release() But L only needed if i==j. Fine-grain locking lock L[n] thread 1: thread 2: L[i].acquire() L[j].acquire() ctr[i]++ ctr[j]++ L[i].release() L[j].release() More parallelism but deadlock move(n, i, j): L[i].acquire() L[j].acquire() // (there’s a bug here) acct[i] -= n acct[j] += n L[i].release() L[j].release() thread 1: thread 2: move(100,2,3) move(50,3,2) Spinning vs. Blocking While (!condition) {} // do nothing Condition Synchronization Not any order, but some specific order Q.remove(): Q.insert(d): atomic atomic await !Q.empty() await !Q.full() // return data from next full slot // put d in next empty slot Point-to-Point : flag All-to-All : barrier Safety and Liveness Safety means that bad things never happen. E.g., we never have two threads in a critical section for the same lock at the same time; Liveness means that good things eventually happen. E.g., if lock L is free and at least one thread is waiting for it, some thread eventually acquires it. For predicates P on reachable system states S, Safety: FOR-ALL S [P(S)] Liveness: FOR-ALL S[P(S) --> THERE-EXISTS[Q(T)]] Safety (3.1) ------------ DEADLOCK FREEDOM As noted in Section 1.4, deadlock freedom is a safety property: it requires four simultaneous conditions: exclusive use – threads require access to some sort of non-sharable “resources” hold and wait – threads wait for unavailable resources while continuing to hold resources they have already acquired irrevocability – resources cannot be forcibly taken from threads that hold them circularity – there exists a circular chain of threads in which each is holding a resource needed by the next Most common approach -- break circularity condition and have static order of acquiring locks Emerging, e.g., transactional memory -- break revocability condition Liveness (3.2) Liveness means that good things eventually happen. A method is said to be wait free (the strongest variant of nonblocking progress) if it is guaranteed to complete in some bounded number of its own program steps. (This bound need not be statically known.) A method M is said to be lock free (a somewhat weaker variant) if SOME thread is guaranteed to make progress (complete an operation on the same object) in some bounded number of M’s program steps. A method is said to be obstruction free (the weakest variant of nonblocking progress) if it is guaranteed to complete in some bounded number of program steps if no other thread executes any steps during that same interval. State-of-the-Art Much theoretical work on guaranteed liveness and soem gurus applying in practice. On the other hand, many systems enginee liveness solutions that work in practice -- e.g., exponential backoff. ------------------------------ Memory Consistency (too much) ------------------- Coherence's Goal: * Make cache is invisible as in uniprocessors * Once invisible, what does memory do? (can't refer to caches) Example /* initial A = B = flag = 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ B = 1; print A; flag = 1; print B; Intuition says printed A = B = 1 (OMIT) Coherence doesn't say anything, why? (OMIT) Consider coalescing write buffer /* initial A = B = 0 */ P1 P2 A = 1; B = 1 r1 = B; r2 = A; There outcomes but not fourth. (OMIT) How screw up? * write buffer * ooo loads * dir protocol that doesn't wait to ack Define Memory Consistecny * What SW can expect of HW * What HW must provide to SW Want 4Ps: (DID NOT COVER) * Programmability: EZ to program * Performance: faciliates good performance (or low cost) * Portable * Precision Sequential Consistency A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. Show two processors po and global mo Show "railroad switch" picture Same as multi-threaded uniprocessor Total Store Ordering (TSO) -- x86 and SPARC ------------------------------------------ /* initial A = B = 0 */ P1 P2 A = 1; B = 1 r1 = B; r2 = A; Allows r1 == r2 == 0? (Why? write buffers) Relaxed (Weak) Ordering -- ARM, IBM Power ----------------------------------------- /* initially all 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ B = 1; r1 = A; flag = 1; r2 = B; But many (most) order NOT necessary! E.g., "A = 1" and "B = 1" Why not just enforce necessary orders? Relaxed models * Unordered by default * Use FENCEs to get necesarry order /* initially all 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ FENCE B = 1; r1 = A; FENCE flag = 1; r2 = B; ---- STOPPED HERE Sequential Consistency for Data-Race-Free (SC for DRF) Programs Cake and Eat it too. FENCE lock(L) FENCE A = 1 r2 = B B = 1 r1 = A FENCE unlock(L) FENCE All four outcomes possible (r1,r2) = (0,0), (0,1), (1,0), (1,1) But if both use locks then two outcomes (r1,r2) = (0,0), (1,1) FENCE lock(L) FENCE r2 = B r1 = A FENCE unlock(L) FENCE FENCE lock(L) FENCE A = 1 B = 1 FENCE unlock(L) FENCE Or FENCE lock(L) FENCE A = 1 B = 1 FENCE unlock(L) FENCE FENCE lock(L) FENCE r2 = B r1 = A FENCE unlock(L) FENCE But can't "see" intra-critical-section reordering * Philosophy: If a tree fall in the woods, does it make a sound? Y or N but probably Y * SC for DRF: If references reorder w/i C.S. does any see? N for DRF SC for DRF Implications * Most programmers can reason with SC * HW implementor can implement XC * (Compiler/runtime can also reorder some) Hardware Synchronization Primitives cover only test-and-set and compare-and-swap test and set Boolean TAS(Boolean *a): atomic { t := *a; *a := true; return t } swap word Swap(word *a, word w): atomic { t := *a; *a := w; return t } fetch and increment int FAI(int *a): atomic { t := *a; *a := t + 1; return t } fetch and add int FAA(int *a, int n): atomic { t := *a; *a := t + n; return t } compare and swap Boolean CAS(word *a, word old, word new): atomic { t := (*a == old); if (t) *a := new; return t } load linked / store conditional word LL(word *a): atomic { remember a; return *a } Boolean SC(word *a, word w): atomic { t := (a is remembered, and has not been evicted since LL) if (t) *a := w; return t } of ABA Problem -- super subtle!!