-------------------------------------------------------------------- CS 758 Programming Multicore Processors Fall 2013 Section 1 Instructor David A. Wood Derived from Mark D. Hill's notes -------------------------------------------------------------------- ------------ Synchronization, cont, and TBB ------------ OUTLINE * Whiter HW1? * Hardware Atomics * ABA -- MESSED UP * Implementing Locks * Implementing Barriers See Scott Sythesis Lecture 2012. ------------------------------ HW1? ------------------------------ Hardware Synchronization Primitives Atomicity All or nothing Loads and stores What atomicity guarantees does hardware provide? Bit? Byte? 16-bit? 32-bit? 64-bit? Cache line? What if bytes cross cache line boundaries? Page boundaries? test and set Boolean TAS(Boolean *a): atomic { t := *a; *a := true; return t } swap word Swap(word *a, word w): atomic { t := *a; *a := w; return t } fetch and increment int FAI(int *a): atomic { t := *a; *a := t + 1; return t } fetch and add int FAA(int *a, int n): atomic { t := *a; *a := t + n; return t } compare and swap Boolean CAS(word *a, word old, word new): atomic { t := (*a == old); if (t) *a := new; return t } load linked / store conditional word LL(word *a): atomic { remember a; return *a } Boolean SC(word *a, word w): atomic { t := (a is remembered, and has not been evicted since LL) if (t) *a := w; return t } ERROR IN LECTURE: In order to prevent ABA problem, need to ensure that NOTHING read between (inclusively) the LL and the SC has changed. This includes evictions as well as invalidations. ------------------------------ ABA Problem -- super subtle!! Example: Linked list push() and pop() void push(node** top, node* new): node* old repeat old := *top new!next := old until CAS(top, old, new) node* pop(node** top): 2: node* old, new repeat old := *top if old = null return null new := old!next until CAS(top, old, new) return old Initial State: Top --> A --> C --| T0 calls pop(), reads top --> A and A.next --> C, then hangs T1 calls pop(), removing A, then push(B) and push(A) Intermediate state: Top --> A --> B --> C --> | T0: performs CAS(A) on Top, which succeeds Final (broken) state: Top --------> C --| B ----^ Called ABA problem, because in this example Top points to A, then B, then A again. Final state is broken as a result. How does LL/SC fix this? How can we solve this using only CAS? Add a sequence number to the pointer (so called counted pointer) ptr =
So T0 will be comparing for , but find
------------------------------
Implementing Locks (Chapter 4)
test-and-set() Figure 4.4
test-and-test-and-set() Figure 4.5
test-and-test-and-set() + backoff
Ticket Lock -- handwave -- note FIFO c.f., Figure 4.7
Queued locks --
Hardware == > QOLB
Software ==> Anderson, Graunke and Thakkar
Spin on local counter, not shared counter
Exploit cache coherence
Uses array, so false sharing
MCS Lock -- handwave -- c.f., Figure 4.9
Eliminates false sharing
More overhead
Reactive locks --
Use a simple lock first to reduce overhead
Then fall back on a more heavy-weight lock
Use CAS or SWAP to control decision pointer
Nested Lock Section 4.4
(Double-check locking? Text and Figure 4.15)
Asymetric locking idea -- bus parking
------------------------------
Implementing Barriers (Chapter 5)
Centralized barrier Section 5.1
Sense-reversing centralized barrier -- Figure 5.1
Combining trees
barrier, Fetch_and_phi
reductions v. parallel prefix
hardware
software trees (Yew)
Dissemination barrier -- Figure 5.3
Static tree barrier -- Figure 5.7
Which barrier
* Centralized if small or skewed (few laggards)
* With broadcast coherence, static tree with global wakeup flag
* Dissemination barrier for great asymptote (with sufficient bandwidth)
* Experiement!
Fuzzy barrier idea
C++ Fences
-----------
Here's a different example that I think demonstrates correct usage better.
#include