-------------------------------------------------------------------- CS 757 Parallel Computer Architecture Spring 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- Outline * Methods -- Speedup * Splash-2 * (PARSEC) * Hill & Marty Talk ------------------- SPEED UP -- Need to beef up Speedup(P) = Speed(P)/Speed(1) = Time(1)/Time(P) Actually this is "problem constrained" or Strong Scaling What is use faster machine to solve bigger problem? Can't use Time(1) / Time-Big(P) = 1 Work(P) / Work (1) or Weak Scaling What is work? **MORE HERE** Aside: Cost-Effective Parallel Computing Isn't Speedup(C) < C Inefficient? (C = #cores) Much of a Computer's Cost OUTSIDE Processor [Wood & Hill, IEEE Computer 2/1995] Let Costup(C) = Cost(C)/Cost(1) Parallel Computing Cost-Effective: Speedup(C) > Costup(C) 1995 SGI PowerChallenge w/ 500MB: Costup(32) = 8.6 Multicores haveven lower Costups!!! -------------------- NOT USED Back to Multicore Helps Power (Over) Simply Performance1 = 1 Core x Frequency Power1 = 1 Core x Frequency x Voltage^2 Duplicate Cores & Halve Frequency Performance2 = 2 x Core x Frequency / 2 = Performance1 Power2 = 2 x Core x Frequency/2 x Voltage^2 = Power1 Then Scale Voltage Performance2' = Performance1 Power2' = 2 x Core x Frequency/2 x (Voltage/k)^2 = Power1/k where k may be 2 Leftovers from Sutter&Larus How important is determinism? (David Hinkemeyer) Functional programming & parallelism (Marc deK) VHDL/Verilog (Aditya Godse) -------------------- Woo et al., SPLASH-2, ISCA 1995. Emphasis Points from Reviews: * False Sharing - Syed Gilani, Tony Gregerson, Marc deK, Asim Kadav * Working Sets - Andy Nere, Guoliang Jin * Comp/Comm Ratio - Brian Hickmann * Know Apps - Shijin Kong * Inceasing memory gap & PRAM, Eric Reiss Presents Splash-2, characterizes them, & gives some methodological advice Presents Splash-2 (Section 3) 8 applications & 4 kernels Concurrency & load balance (Section 4) Figure 1 gives PRAM speedups All good, but LU, Cholesky, Radix, and Radiosity Figure 2 shows LU, Cholesky, & Radiosity do too much synchronization (for this data size) Radix's problem is is tree-like parallel prefix component Working Sets (5) Have "knees" (e.g., Figure 3 shows Barnes has 3) Very important to scaling to have appropriate WS in or out of caches. Communication to Computation Ratio (6) Aside: Communication in shared memory implicit with cache coherence If processor R read data last written by processor W, 4-case cross-product: * at processor R, either invalid or non-present times * at processor W, either modified or non-present (I simplify) In all four cases, processor R effectives cache misses. Adding processor without increasing problem size will usually increase CCR. Must pay attention to in methods Spatial Locality & False Sharing (7) With perfect spatial locality, doubling block size halves miss ratio Actual results mixed -------------------------------- Programming Models Definitions Thread -- PC Process -- address space Basic Programming Models (draw pictures) Multitasking: n x ( 1 thread/process) w/o communication Message-Passing n x ( 1 thread/process) w/ messages Shared Memory: n threads in 1 process (how communcate?) Shared Memory': n x ( 1 thread/process) w/ "System V" shared memory Sequential: 1 thread/process with parallelizing software Data Parallel 1 thread/process with data parallel ops (generalization of SIMD) Simple Problem (NOT USED YET) for i = 1 to N A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] Split the loops Independent iterations for i = 1 to N A[i] = (A[i] + B[i]) * C[i] for i = 1 to N sum = sum + A[i] Data flow graph? Shared Memory private int i, my_start, my_end, mynode; shared float A[N], B[N], C[N], sum; for i = my_start to my_end A[i] = (A[i] + B[i]) * C[i] GLOBAL_SYNCH; if (mynode == 0) for i = 1 to N sum = sum + A[i] Message Passing int i, my_start, my_end, mynode; float A[N/P], B[N/P], C[N/P], sum; for i = 1 to N/P A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] if (mynode != 0) send (sum,0); if (mynode == 0) for i = 1 to P-1 recv(tmp,i) sum = sum + tmp ----------------