-------------------------------------------------------------------- CS 757 Parallel Computer Architecture Spring 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- ------------ Introduction ------------ Download CSTB Talk Write Outline on Board * CSTB Talk * Class Structure * Sutter/Laris * Maybe Olukotun&Hammond Queue05 Next Lecture 752 Review Niagara ------------------------------ Software and the Concurrency Revolution Herb Sutter & Jim Larus ACM Queue 09/2005 Concurrency more disruptive than OO * sole path to high performance * concurrent programming hard (e.g., can't look at just one context Where concurrency? * Easy to FIND on servers or cloud (if small requests mediated by shared store) (but still a challenge to exploit) * hard to even find on client (Modern apps split accross both) High-order issues * granularity of operations -- from single instructions to large executions * degree of coupling -- communicationa and synchronization -- small to embarrously parallel Types of parallelism * Independent matrix A = matrix A * 2 * Regular A[i,j] = avg of neighbors * Unstructured How coordinate? * Locks are the default but not composable (mus peek into implementation of abstraction) subtle deadlock optional conventions * Synchronized methods -- too strong and too weak * Lock-free programming -- too hard * TM -- not here yet What from PLs? * Automatic parallelism would be nice * functional programming to rescue -- probably not due ot side-effects * high-level abstraction (revealed for functional programming) promising (e.g., map-reduce) * Also futures & active objects ------------------------------ DID NOT USE The Future of Microprocessors Kunle Olukotun & Lance Hammond ACM Queue 09/2005 Get Performance See Fig. 1 on Intel Performance Over Time Factors: (1) Faster Clock (2) Instruction Level Parallelism (3) Multiprocessing/multithreading (1) Faster Clock hitting power -- see Fig. 3 Intel Power Aside: Multicore Helps Power (Over) Simply Performance1 = 1 Core x Frequency Power1 = 1 Core x Frequency x Voltage^2 Duplicate Cores & Halve Frequency Performance2 = 2 x Core x Frequency / 2 = Performance1 Power2 = 2 x Core x Frequency/2 x Voltage^2 = Power1 Then Scale Voltage Performance2' = Performance1 Power2' = 2 x Core x Frequency/2 x (Voltage/k)^2 = Power1/k where k may be 2 (2) Mined out (see Fig. 2 Intel ILP) Futhermore, memory latency. (3) Goto Multiprocessing EZ for servers Clients? Parallel programming in graduate course -- guilty CMP easier than SMP Communication with higher BW and lower latency (But shared cache) Can't hardly buy a single core today.