-------------------------------------------------------------------- CS 758 Programming Multicore Processors Fall 2013 Section 1 Instructor David A. Wood Derived from Mark D. Hill's notes -------------------------------------------------------------------- ------------ Synchronization, cont, and TBB ------------ OUTLINE * Whiter HW1? * Hardware Atomics * ABA -- MESSED UP * Implementing Locks * Implementing Barriers See Scott Sythesis Lecture 2012. ------------------------------ HW1? ------------------------------ Hardware Synchronization Primitives Atomicity All or nothing Loads and stores What atomicity guarantees does hardware provide? Bit? Byte? 16-bit? 32-bit? 64-bit? Cache line? What if bytes cross cache line boundaries? Page boundaries? test and set Boolean TAS(Boolean *a): atomic { t := *a; *a := true; return t } swap word Swap(word *a, word w): atomic { t := *a; *a := w; return t } fetch and increment int FAI(int *a): atomic { t := *a; *a := t + 1; return t } fetch and add int FAA(int *a, int n): atomic { t := *a; *a := t + n; return t } compare and swap Boolean CAS(word *a, word old, word new): atomic { t := (*a == old); if (t) *a := new; return t } load linked / store conditional word LL(word *a): atomic { remember a; return *a } Boolean SC(word *a, word w): atomic { t := (a is remembered, and has not been evicted since LL) if (t) *a := w; return t } ERROR IN LECTURE: In order to prevent ABA problem, need to ensure that NOTHING read between (inclusively) the LL and the SC has changed. This includes evictions as well as invalidations. ------------------------------ ABA Problem -- super subtle!! Example: Linked list push() and pop() void push(node** top, node* new): node* old repeat old := *top new!next := old until CAS(top, old, new) node* pop(node** top): 2: node* old, new repeat old := *top if old = null return null new := old!next until CAS(top, old, new) return old Initial State: Top --> A --> C --| T0 calls pop(), reads top --> A and A.next --> C, then hangs T1 calls pop(), removing A, then push(B) and push(A) Intermediate state: Top --> A --> B --> C --> | T0: performs CAS(A) on Top, which succeeds Final (broken) state: Top --------> C --| B ----^ Called ABA problem, because in this example Top points to A, then B, then A again. Final state is broken as a result. How does LL/SC fix this? How can we solve this using only CAS? Add a sequence number to the pointer (so called counted pointer) ptr = So T0 will be comparing for , but find ------------------------------ Implementing Locks (Chapter 4) test-and-set() Figure 4.4 test-and-test-and-set() Figure 4.5 test-and-test-and-set() + backoff Ticket Lock -- handwave -- note FIFO c.f., Figure 4.7 Queued locks -- Hardware == > QOLB Software ==> Anderson, Graunke and Thakkar Spin on local counter, not shared counter Exploit cache coherence Uses array, so false sharing MCS Lock -- handwave -- c.f., Figure 4.9 Eliminates false sharing More overhead Reactive locks -- Use a simple lock first to reduce overhead Then fall back on a more heavy-weight lock Use CAS or SWAP to control decision pointer Nested Lock Section 4.4 (Double-check locking? Text and Figure 4.15) Asymetric locking idea -- bus parking ------------------------------ Implementing Barriers (Chapter 5) Centralized barrier Section 5.1 Sense-reversing centralized barrier -- Figure 5.1 Combining trees barrier, Fetch_and_phi reductions v. parallel prefix hardware software trees (Yew) Dissemination barrier -- Figure 5.3 Static tree barrier -- Figure 5.7 Which barrier * Centralized if small or skewed (few laggards) * With broadcast coherence, static tree with global wakeup flag * Dissemination barrier for great asymptote (with sufficient bandwidth) * Experiement! Fuzzy barrier idea C++ Fences ----------- Here's a different example that I think demonstrates correct usage better. #include #include #include std::atomic flag(false); int a; void func1() { a = 100; atomic_thread_fence(std::memory_order_release); flag.store(true, std::memory_order_relaxed); } void func2() { while(!flag.load(std::memory_order_relaxed)) ; atomic_thread_fence(std::memory_order_acquire); std::cout << a << '\n'; // guaranteed to print 100 } int main() { std::thread t1 (func1); std::thread t2 (func2); t1.join(); t2.join(); } The load and store on the atomic flag do not synchronize, because they both use the relaxed memory ordering. Without the fences this code would be a data race, because we're performing conflicting operations a non-atomic object in different threads, and without the fences and the synchronization they provide there would be no happens-before relationship between the conflicting operations on a. However with the fences we do get synchronization because we've guaranteed that thread 2 will read the flag written by thread 1 (because we loop until we see that value), and since the atomic write happened after the release fence and the atomic read happens-before the acquire fence, the fences synchronize. (see ยง 29.8/2 for the specific requirements.) This synchronization means anything that happens-before the release fence happens-before anything that happens-after the acquire fence. Therefore the non-atomic write to a happens-before the non-atomic read of A.F.L.-C.I.O.