--------------------------------------------------------------------
CS 758 Programming Multicore Processors 
Fall 2012 Section 1
Instructor Mark D. Hill
--------------------------------------------------------------------


------------
Introduction
------------


Write Outline on Board

* PRAM
* Class Structure
* Models
* Sutter/Laris

Next Lecture
   Architecture & 757
   Review Niagara (new questions)

------------------------------------

PRAM

Sequential
 Time to sum n numbers? O(n)
 Time to sort n numbers? O(n log n)
 What model? RAM

Parallel
 Time to sum? Tree for O(log n)
 Time to sort? Non-trivially O(log n)
 What model?
 PRAM [Fortune Willie STOC78]
 P processors in lock-step
 One memory (e.g., CREW foroncurrent read exclusive write)


 <Show picture>


Why not realistic?

 Asychrony means synchronization needed

 Latencies grow as the system size grows

 Bandwidths are restricted by memory organizations and interconnection networks

 Dealing with reality leads to division between
    UMA: Uniform Memory Access
    and 
    NUMA: Non-Uniform Memory Access

How build?

------------------------------------
Class Structure

------------------------------------

How build?

  Show:

  P  P  P     P  P  P     P--P--P
  |  |  |     |  |  |     |  |  |
  M  M  M     M--M--M     M--M--M
  |  |  |     |  |  |     |  |  |
  N--N--N     N--N--N     N--N--N  

Shared I/O   Shared Memory All Shared (SIMD?)

Much of this course -- shared memory
Later -- clusters for shared I/O
SIMD

Flynn's Taxonomy
    SISD
    SIMD
    MIMD
    (MISD)


Definitions

   Thread -- PC
   Process -- address space

Basic Programming Models (draw pictures)
  Multitasking:		n x ( 1 thread/process) w/o communication
  Message-Passing 	n x ( 1 thread/process) w/ messages
  Shared Memory:	n threads in 1 process (how communcate?)
  Shared Memory': 	n x ( 1 thread/process) w/ "System V" shared memory
  Sequential: 		1 thread/process with parallelizing software
  Data Parallel         1 thread/process with data parallel ops
  (generalization of SIMD) or n threads in lock-step w/ shared memory
  GPU SIMT (Single Instruction Multiple Threads)
  to first-order        n threads in lock-step w/ shared memory
  			best performance: program as data parallel
			but threads can diverge -- dividing performance

	
------------------------------

Software and the Concurrency Revolution
Herb Sutter & Jim Larus
ACM Queue 09/2005

Concurrency more disruptive than OO
* sole path to high performance
* concurrent programming hard (e.g., can't look at just one context

Where concurrency?
* Easy to FIND on servers or cloud (if small requests mediated by shared store)
  (but still a challenge to exploit)
* hard to even find on client
  (Modern apps split accross both) 

High-order issues
* granularity of operations -- from single instructions to large executions
* degree of coupling -- communicationa and synchronization -- small to embarrously parallel

Types of parallelism
* Independent
  matrix A = matrix A * 2
* Regular
  A[i,j] = avg of neighbors
* Unstructured

How coordinate?
* Locks are the default but
  not composable (mus peek into implementation of abstraction)
  subtle deadlock
  optional conventions
* Synchronized methods -- too strong and too weak
* Lock-free programming -- too hard
* TM -- not here yet

What from PLs?
* Automatic parallelism would be nice
* functional programming to rescue -- probably not due ot side-effects
* high-level abstraction (revealed for functional programming) promising (e.g., map-reduce)
* Also futures & active objects