--------------------------------------------------------------------
CS 757 Parallel Computer Architecture
Spring 2012 Section 1
Instructor Mark D. Hill
--------------------------------------------------------------------

------------
Introduction
------------

Download CSTB Talk

Write Outline on Board

* CSTB Talk
* Class Structure
* Sutter/Laris
* Maybe Olukotun&Hammond Queue05

Next Lecture
   752
   Review Niagara

------------------------------

Software and the Concurrency Revolution
Herb Sutter & Jim Larus
ACM Queue 09/2005

Concurrency more disruptive than OO
* sole path to high performance
* concurrent programming hard (e.g., can't look at just one context

Where concurrency?
* Easy to FIND on servers or cloud (if small requests mediated by shared store)
  (but still a challenge to exploit)
* hard to even find on client
  (Modern apps split accross both) 

High-order issues
* granularity of operations -- from single instructions to large executions
* degree of coupling -- communicationa and synchronization -- small to embarrously parallel

Types of parallelism
* Independent
  matrix A = matrix A * 2
* Regular
  A[i,j] = avg of neighbors
* Unstructured

How coordinate?
* Locks are the default but
  not composable (mus peek into implementation of abstraction)
  subtle deadlock
  optional conventions
* Synchronized methods -- too strong and too weak
* Lock-free programming -- too hard
* TM -- not here yet

What from PLs?
* Automatic parallelism would be nice
* functional programming to rescue -- probably not due ot side-effects
* high-level abstraction (revealed for functional programming) promising (e.g., map-reduce)
* Also futures & active objects


------------------------------

DID NOT USE

The Future of Microprocessors
Kunle Olukotun & Lance Hammond
ACM Queue 09/2005

Get Performance 

See Fig. 1 on Intel Performance Over Time

Factors:

(1) Faster Clock
(2) Instruction Level Parallelism 
(3) Multiprocessing/multithreading

(1) Faster Clock hitting power -- see Fig. 3 Intel Power

Aside: Multicore Helps Power

(Over) Simply
  Performance1 = 1 Core x Frequency
  Power1 = 1 Core x Frequency x Voltage^2

Duplicate Cores & Halve Frequency
  Performance2 = 2 x Core x Frequency / 2 = Performance1
  Power2 = 2 x Core x Frequency/2 x Voltage^2 = Power1

Then Scale Voltage
  Performance2' = Performance1
  Power2' = 2 x Core x Frequency/2 x (Voltage/k)^2 = Power1/k where k may be 2


(2) Mined out (see Fig. 2 Intel ILP)

Futhermore, memory latency.

(3) Goto Multiprocessing

EZ for servers
Clients?

Parallel programming in graduate course -- guilty

CMP easier than SMP

   Communication with higher BW and lower latency
   (But shared cache)

Can't hardly buy a single core today.