-------------------------------------------------------------------- CS 758 Programming Multicore Processors Fall 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- ------------ Synchronization, cont, and TBB ------------ OUTLINE * Memory Consistency, cont. * Synch primitive * ABA, etc. -- MESSED UP * TBB -- DID TBB FIRST ------------------------------ Memory Consistency, cont. ------------------- Review Sequential Consistency A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. Show two processors po and global mo Show "railroad switch" picture Same as multi-threaded uniprocessor Sequential Consistency for Data-Race-Free (SC for DRF) Programs Cake and Eat it too. FENCE lock(L) FENCE A = 1 r2 = B B = 1 r1 = A FENCE unlock(L) FENCE All four outcomes possible (r1,r2) = (0,0), (0,1), (1,0), (1,1) But if both use locks then two outcomes (r1,r2) = (0,0), (1,1) FENCE lock(L) FENCE r2 = B r1 = A FENCE unlock(L) FENCE FENCE lock(L) FENCE A = 1 B = 1 FENCE unlock(L) FENCE Or FENCE lock(L) FENCE A = 1 B = 1 FENCE unlock(L) FENCE FENCE lock(L) FENCE r2 = B r1 = A FENCE unlock(L) FENCE But can't "see" intra-critical-section reordering * Philosophy: If a tree fall in the woods, does it make a sound? Y or N but probably Y * SC for DRF: If references reorder w/i C.S. does any see? N for DRF SC for DRF Implications * Most programmers can reason with SC * HW implementor can implement XC * (Compiler/runtime can also reorder some) Hardware Synchronization Primitives cover only test-and-set and compare-and-swap test and set Boolean TAS(Boolean *a): atomic { t := *a; *a := true; return t } swap word Swap(word *a, word w): atomic { t := *a; *a := w; return t } fetch and increment int FAI(int *a): atomic { t := *a; *a := t + 1; return t } fetch and add int FAA(int *a, int n): atomic { t := *a; *a := t + n; return t } compare and swap Boolean CAS(word *a, word old, word new): atomic { t := (*a == old); if (t) *a := new; return t } load linked / store conditional word LL(word *a): atomic { remember a; return *a } Boolean SC(word *a, word w): atomic { t := (a is remembered, and has not been evicted since LL) if (t) *a := w; return t } of Other ABA Problem -- super subtle!! SC, Linearizability, Serializability, & Strict Serializability IntelĀ® Threading Building Blocks Tutorial Document Number 319872-010US http://pages.cs.wisc.edu/~markhill/restricted/intel12_TBBtutorial.pdf Background * C++ Library, not language, so often bizzare syntax * Decended from OpenMP, Std Template Library, STAPL, recently CILK * "Just Say, 'Yes'" Software. (parallel)_for all X, should we add X? Yes. Chapter 3 "Simple" Loops -======================== serial for p.12 ----------- avoid SerialApplyFoo( float a[], size_t n ) { for( size_t i=0; i!=n; ++i ) Foo(a[i]); } parallel for ------------ #include "tbb/tbb.h" using namespace tbb; class ApplyFoo { float *const my_a; public: void operator()( const blocked_range& r ) const { float *a = my_a; for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]); } ApplyFoo( float a[] ) : my_a(a) {} }; parallel for w lamdba function p.14 ------------------------------ using namespace tbb; #pragma warning( disable: 588) void ParallelApplyFoo( float* a, size_t n ) { parallel_for( blocked_range(0,n), [=](const blocked_range& r) { for(size_t i=r.begin(); i!=r.end(); ++i) Foo(a[i]); } ); } Chunking Automatic or Controlled p.15 parallel_reduce p.21 Sequential float SerialSumFoo( float a[], size_t n ) { float sum = 0; for( size_t i=0; i!=n; ++i ) sum += Foo(a[i]); return sum; } For indepentent iterations: float ParallelSumFoo( const float a[], size_t n ) { SumFoo sf(a); parallel_reduce( blocked_range(0,n), sf ); return sf.my_sum; } Plus: class SumFoo { float* my_a; public: float my_sum; void operator()( const blocked_range& r ) { float *a = my_a; float sum = my_sum; size_t end = r.end(); for( size_t i=r.begin(); i!=end; ++i ) sum += Foo(a[i]); my_sum = sum; } SumFoo( SumFoo& x, split ) : my_a(x.my_a), my_sum(0) {} void join( const SumFoo& y ) {my_sum+=y.my_sum;} SumFoo(float a[] ) : my_a(a), my_sum(0) {} } Chapter 6 Containers p.41 concurrent_hash_map concurrent_vector concurrent queues Chapter 7 Mutual Exclusion p. 50 Locks etc. Chapter 8 Atomic operations p.56 Chapter 11 The Task Schedule p.65 Actually CILK -- discussed later- -