-------------------------------------------------------------------- CS 758 Programming Multicore Processors Fall 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- ------------ Cilk ------------ * Fib Example * CILK Math * Cilk compilation, THE, and runtime data For code to run a long time, must repeat iteration recursion Recursive -- Not data parallel Fig Example Write out fib from Figure 1 Serial Elision -- show Fib tree from Book Figure 27 p. 775 cilk -- procedure spawn -- call & continue ==> Does not "spawn" new work. Like a regular procedure call, leaves a call frame on the "stack". sync -- wait -- local "barrier" like a memory fence Inlet allows x += spawn(fib(n-1) abort explicit locking Work Stealing with per-processor double-ended queue. Local processor enqueue and deques to/from tail (similar to a regular call stack) Other processors steal from head 1 2 3 4 fib(5) push fib(4) steal 1:fib(4) push/pop fib(3) push fib(3) push fib (2) push/pop fib(2) steal 2:fib(3) push/pop fib(1) push fib(1) push fib(2) ... Cilk Math ========= 16 computations that feed 16 then 8 then 4 then 2 then 1 32 C's C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C T_inf = 1+1+1+1+1+1 = 6 T_1 = 32+16+8+4+2+1 = 63 For P=4 T_P = T_4 = 8 + 4 + 2 + 1 + 1 + 1 = 17 Work Bound ---------- T_P >= T_1 / P (Work Law) T_4 = 17 >= 63/4 = 15.75 Critical Path (span) Bound ------------------- T_P >= T_inf (Span Law) T_4 = 17 >= 6 Average Parallelism -------------------- Pbar = T_1 / T_inf Pbar = 63/6 = 10.5 Parallel Slackness ------------------ Pbar / P 10.5/4 = 2.625 Want T_P = Work + Critical Path (aka Span) T_P = T_1/P + O(T_inf) T_P <= T_1/P + c_inf*T_inf where c_inf, critical path overhead, smallest constant where true ASSUMPTION OF PARALLEL SLACKNESS Pbar/P >> c_inf T_P <= T_1/P + <<(T_1/T_inf)/P)*T_inf T_P <= T_1/P + <<(T_1/P) = T_1/P WITH PARALLEL SLACK ==> "Ignore" critical path overhead Since T_P >= T_1/P (Work Law) T_P ~= T_1/P Work overhead? T_S = one processor w/o overhead c_1 = T_1/T_S c_1*T_S = T_1 T_P ~= (c_1*T_S)/P Want to minimize c_1!!!! Minimize overhead that applied to ALL work items. Worry less of overhead that SOMETIMES applies -- minimize how often -- minimize work stealing. Cilk Compilation ---------------- fast clone -- same processor for it AND ALL slow clone -- parallelism but overhead -- convert when stolen Because of deque -- descendents of fast clone are always fast clones THE Protocol ------------ Resolve possible, but rare conflict between worker and one or more theives. Minimize worker work when no thief -- usually need not get lock L. Similar to Asymmetric or biassed locks Push just non-atomically increments T (tail) e.g., tmp=T;tmp=tmp+1;T=tmp Pop try to non-atomically decrement T E variable missing in simplified code Runtime Data ---- Show data in Figure 6 Cormean Chapter =============== Strands: parallel analog to basic blocks. Basic blocks are delimited by sequential control divergence/convergence points Strands are delimited by parallel control divergence/convergence points Figure 27.2 Put up figure Discuss spawn, continuation, and return edges Discuss strands Compute Work= T_1 and Span=T_inf Discuss "A chess example" Alg #1: T_1 = 2048 T_inf = 1 T_32 ~= 2048/32 + 1 = 64 + 1 = 65 T_512 ~= 2048/512 + 1 = 4 + 1 = 5 Alg #2 T_1 = 1024 T_inf = 8 T_32 ~= 1024/32 + 8 = 32 + 8 = 40 Alg. Speedup = 1.625 T_512 ~= 1024/512 + 8 = 2 + 8 = 10 Alg. Speedup = 0.5!!! What went wrong? P = Parallelism = T_1 / T_inf Alg #1 P = 2048/1 = 2048 Alg #2 P = 1024/8 = 128 P_bar = Parallel Slackness = T_1 / (P*T_inf) Alg #1 P_bar (32) = 2048 / (1 *32) = 64 P_bar (512) = 2048 / (1 * 512) = 4 Alg #2 P_bar (32) = 1024 / (8 * 32) = 4 P_bar (512) = 1024 / (8 * 512) = 0.25