--------------------------------------------------------------------
CS 758 Programming Multicore Processors 
Fall 2012 Section 1
Instructor Mark D. Hill
--------------------------------------------------------------------

------------
Synchronization, cont, and TBB
------------

OUTLINE
* Memory Consistency, cont.
* Synch primitive
* ABA, etc. -- MESSED UP
* TBB -- DID TBB FIRST

------------------------------

Memory Consistency, cont.
-------------------

Review Sequential Consistency

A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.

Show two processors po and global mo

Show "railroad switch" picture

Same as multi-threaded uniprocessor


Sequential Consistency for Data-Race-Free (SC for DRF) Programs

Cake and Eat it too.


	FENCE
	lock(L)
	FENCE
	A = 1		r2 = B
	B = 1		r1 = A
	FENCE
	unlock(L)
	FENCE

All four outcomes possible (r1,r2) = (0,0), (0,1), (1,0), (1,1)

But if both use locks then two outcomes (r1,r2) = (0,0), (1,1)


			FENCE
			lock(L)
			FENCE
			r2 = B
			r1 = A
			FENCE
			unlock(L)
			FENCE
	FENCE
	lock(L)
	FENCE
	A = 1
	B = 1
	FENCE
	unlock(L)
	FENCE

Or
	FENCE
	lock(L)
	FENCE
	A = 1
	B = 1
	FENCE
	unlock(L)
	FENCE

			FENCE
			lock(L)
			FENCE
			r2 = B
			r1 = A
			FENCE
			unlock(L)
			FENCE


But can't "see" intra-critical-section reordering
* Philosophy:  If a tree fall in the woods, does it make a sound? Y or N but probably Y
* SC for DRF:  If references reorder w/i C.S. does any see? N for DRF

SC for DRF Implications
* Most programmers can reason with SC
* HW implementor can implement XC
* (Compiler/runtime can also reorder some)


Hardware Synchronization Primitives

cover only test-and-set and compare-and-swap

test and set
Boolean TAS(Boolean *a): atomic { t := *a; *a := true; return t }

swap
word Swap(word *a, word w): atomic { t := *a; *a := w; return t }

fetch and increment
int FAI(int *a): atomic { t := *a; *a := t + 1; return t }

fetch and add
int FAA(int *a, int n): atomic { t := *a; *a := t + n; return t }


compare and swap
Boolean CAS(word *a, word old, word new):
atomic { t := (*a == old); if (t) *a := new; return t }

load linked / store conditional
word LL(word *a): atomic { remember a; return *a }
Boolean SC(word *a, word w):
atomic { t := (a is remembered, and has not been evicted since LL)
if (t) *a := w; return t }
of


Other

ABA Problem -- super subtle!!

SC, Linearizability, Serializability, & Strict Serializability


Intel® Threading Building Blocks Tutorial
Document Number 319872-010US
http://pages.cs.wisc.edu/~markhill/restricted/intel12_TBBtutorial.pdf

Background

* C++ Library, not language, so often bizzare syntax
* Decended from OpenMP, Std Template Library, STAPL, recently CILK
* "Just Say, 'Yes'" Software. (parallel)_for all X, should we add X? Yes.


Chapter 3 "Simple" Loops
-========================

serial for p.12
-----------
avoid SerialApplyFoo( float a[], size_t n ) {
for( size_t i=0; i!=n; ++i )
Foo(a[i]);
}


parallel for
------------
#include "tbb/tbb.h"
using namespace tbb;
class ApplyFoo {
float *const my_a;
public:
void operator()( const blocked_range<size_t>& r ) const {
float *a = my_a;
for( size_t i=r.begin(); i!=r.end(); ++i )
Foo(a[i]);
}
ApplyFoo( float a[] ) :
my_a(a)
{}
};

parallel for w lamdba function p.14
------------------------------
using namespace tbb;
#pragma warning( disable: 588)
void ParallelApplyFoo( float* a, size_t n ) {
parallel_for( blocked_range<size_t>(0,n),
[=](const blocked_range<size_t>& r) {
for(size_t i=r.begin(); i!=r.end(); ++i)
Foo(a[i]);
}
);
}


Chunking Automatic or Controlled p.15

parallel_reduce p.21

Sequential
float SerialSumFoo( float a[], size_t n ) {
float sum = 0;
for( size_t i=0; i!=n; ++i )
sum += Foo(a[i]);
return sum;
}


For indepentent iterations:

float ParallelSumFoo( const float a[], size_t n ) {
SumFoo sf(a);
parallel_reduce( blocked_range<size_t>(0,n), sf );
return sf.my_sum;
}

Plus:

class SumFoo {
float* my_a;
public:
float my_sum;
void operator()( const blocked_range<size_t>& r ) {
float *a = my_a;
float sum = my_sum;
size_t end = r.end();
for( size_t i=r.begin(); i!=end; ++i )
sum += Foo(a[i]);
my_sum = sum;
}
SumFoo( SumFoo& x, split ) : my_a(x.my_a), my_sum(0) {}
void join( const SumFoo& y ) {my_sum+=y.my_sum;}
SumFoo(float a[] ) :
my_a(a), my_sum(0)
{}
}


Chapter 6 Containers p.41

concurrent_hash_map
concurrent_vector
concurrent queues

Chapter 7 Mutual Exclusion p. 50
Locks etc.

Chapter 8 Atomic operations p.56

Chapter 11 The Task Schedule p.65

Actually CILK -- discussed later-
-