--------------------------------------------------------------------
CS 758 Programming Multicore Processors 
Fall 2012 Section 1
Instructor Mark D. Hill
--------------------------------------------------------------------

------------
Cilk
------------
* Fib Example
* CILK Math
* Cilk compilation, THE, and runtime data

For code to run a long time, must repeat
 iteration
 recursion

Recursive -- Not data parallel

Fig Example

Write out fib from Figure 1

Serial Elision -- show Fib tree from Book Figure 27 p. 775


cilk -- procedure
spawn --  call & continue
	==> Does not "spawn" new work. 
	    Like a regular procedure call, leaves a call frame on the "stack".
sync -- wait -- local "barrier" like a memory fence


Inlet allows x += spawn(fib(n-1)
abort
explicit locking

Work Stealing with per-processor double-ended queue.
Local processor enqueue and deques to/from tail (similar to a regular call stack)
Other processors steal from head


1		2		3		4
fib(5)

push fib(4)
		steal 1:fib(4)
push/pop fib(3)	push fib(3)
push fib (2)	push/pop fib(2)	steal 2:fib(3)
push/pop fib(1) push fib(1)	push fib(2)
...


Cilk Math
=========

16 computations that feed 16 then 8 then 4 then 2 then 1

 32 C's
 C C C C C C C C C C C C C C C C
   C   C   C   C   C   C   C   C
       C       C       C       C
               C               C
                               C


T_inf = 1+1+1+1+1+1 = 6
T_1 = 32+16+8+4+2+1 = 63

For P=4
T_P = T_4 = 8 + 4 + 2 + 1 + 1 + 1 = 17

Work Bound
----------
T_P >= T_1 / P (Work Law)
T_4 = 17 >= 63/4 = 15.75

Critical Path (span) Bound
-------------------
T_P >= T_inf (Span Law)
T_4 = 17 >= 6

Average Parallelism
--------------------
Pbar = T_1 / T_inf
Pbar = 63/6 = 10.5

Parallel Slackness
------------------
Pbar / P
10.5/4 = 2.625


Want
T_P = Work + Critical Path (aka Span)
T_P = T_1/P + O(T_inf)

T_P <= T_1/P + c_inf*T_inf
where c_inf, critical path overhead, smallest constant where true


ASSUMPTION OF PARALLEL SLACKNESS
Pbar/P >> c_inf

T_P <= T_1/P + <<(T_1/T_inf)/P)*T_inf
T_P <= T_1/P + <<(T_1/P) = T_1/P

WITH PARALLEL SLACK ==> "Ignore" critical path overhead
Since T_P >= T_1/P (Work Law)

T_P ~= T_1/P

Work overhead?
T_S = one processor w/o overhead
c_1 = T_1/T_S
c_1*T_S = T_1

T_P ~= (c_1*T_S)/P

Want to minimize c_1!!!!

Minimize overhead that applied to ALL work items.
Worry less of overhead that SOMETIMES applies -- minimize how often --
minimize work stealing.


Cilk Compilation
----------------
fast clone -- same processor for it AND ALL
slow clone -- parallelism but overhead -- convert when stolen

Because of deque -- descendents of fast clone are always fast clones


THE Protocol
------------
Resolve possible, but rare conflict between worker and one or more theives.
Minimize worker work when no thief -- usually need not get lock L.
	Similar to Asymmetric or biassed locks
Push just non-atomically increments T (tail) e.g., tmp=T;tmp=tmp+1;T=tmp
Pop try to non-atomically decrement T
E variable missing in simplified code


Runtime Data
----
Show data in Figure 6

Cormean Chapter
===============

Strands: parallel analog to basic blocks.
Basic blocks are delimited by sequential control divergence/convergence points
Strands are delimited by parallel control divergence/convergence points

Figure 27.2
Put up figure
	Discuss spawn, continuation, and return edges
	Discuss strands
	Compute Work= T_1 and Span=T_inf

Discuss "A chess example"
	Alg #1:
		T_1 = 2048
		T_inf = 1
		T_32 ~= 2048/32 + 1 = 64 + 1 = 65
		T_512 ~= 2048/512 + 1 = 4 + 1 = 5

	Alg #2
		T_1 = 1024
		T_inf = 8
		T_32 ~= 1024/32 + 8 = 32 + 8 = 40	Alg. Speedup = 1.625
		T_512 ~= 1024/512 + 8 = 2 + 8 = 10	Alg. Speedup = 0.5!!!

	What went wrong?
		P = Parallelism = T_1 / T_inf
		Alg #1 P = 2048/1 = 2048
		Alg #2 P = 1024/8 = 128

		P_bar = Parallel Slackness = T_1 / (P*T_inf)
		Alg #1 	P_bar (32) = 2048 / (1 *32) = 64
			P_bar (512) = 2048 / (1 * 512) = 4
		Alg #2 	P_bar (32) = 1024 / (8 * 32) = 4
			P_bar (512) = 1024 / (8 * 512) = 0.25