--------------------------------------------------------------------
CS 758 Programming Multicore Processors 
Fall 2012 Section 1
Instructor Mark D. Hill
--------------------------------------------------------------------


Outline
* OpenMP (see 838 notes?)
* Amdahl's Law, etc.
* Dividing Up Work


--------------------------------

OpenMP

Simple Use (see documentation for richer use):

1. Write sequential program (with eye to parallelization)
2. Add OMP directives to parallelize key loops


Example from addmatrix.c:

// does A = B + C (write without pragma first)
  
  // handle the outer loop in parallel
  #pragma omp parallel for \
    shared(A,B,C,xdim,ydim) \
    private(i,j) \
    schedule(dynamic)
  for(j=0;j<ydim;j++) {

    // nothing special will be done to parallelize the inner loop--
    // this section is already done in parallel
    for(i=0;i<xdim;i++) {
      A[j][i] = B[j][i] + C[j][i];
    }
    
  } // end of parallel for (implicit barrier)
        
}

Control Model

 sequential thread 0
 ...
 begin parallel loop
 Thread 0 continues & FORK thread 1 to n-1
 (threads do different loop iterations)
 JOIN back to just thread 0
 ...
 repeat sequential/parallel phases changes

Comments
* Loop iterations must be independent (in this simple use)
* Which threads to which iterations -- static/dynamic directives
* Data parallel (but can do anything)

Limits/Discussion

MORE

--------------------------------


Amdahl's Law, etc.

Speedup(p) = rate(p)/rate(1) = time(1)/time(p)

Assume time(p) = T*((1-f)*1 + f/p) for f = fraction parallel

time(1) = T

Speedup(p) = T / T*((1-f)*1 + f/p) = 1 / ((1-f) + f/p)

Speedup(inf) = 1 / (1-f)

   f	Speedup(inf)
  ---   ------------
  80%	   50
  90%	   10
  95%      20
  99%	  100

Three comments

(1) It is not that bad, because parallel machine are often used to run
    larger problems and larger problems often have more parallelism.

(2) But is an upper bound, as it does account for many overheads

(3) Simplistic Model where either completely parallel or complete serial


Weak Scaling...


Another model is to use "average parallelism" (A) to find:

  speedup(inf) = A
  speedup(p) <= minimum(A, p)

See Eager et al., "Speedup Versus Efficiency in Parallel Systems,"
IEEE Trans. on Computers, March 1989.

"Proof by Example" for speedup(inf) = A

Draw task "graph" of unit work tasks
Do 1 task, then 4, then 2, then 1

A = (1 + 4 + 2 + 1)/4 = 2
speedup(inf) = A = 2

time(1) = 1 + 4 + 2 + 1 = 8
time(inf) = 4
speedup(inf) = time(1)/time(inf) = 8/4 = 2

QED :-)

   
Aside: Cost-Effective Parallel Computing

	Isn't Speedup(C) < C Inefficient? (C = #cores)

	Much of a Computer's Cost OUTSIDE Processor
	[Wood & Hill, IEEE Computer 2/1995]

	Let Costup(C) = Cost(C)/Cost(1)

	
	Parallel Computing Cost-Effective:
         	Speedup(C) > Costup(C)
	 1995 SGI PowerChallenge w/ 500MB:
	          Costup(32) = 8.6

	Multicores haveven lower Costups!!!


------------------------------------------

Dividing Up Work

**** DID NOT "WORK" -- TOO ABSTRACT *****

* Can think of dividing up monolitic execution
  or
* Do a fine-grain breakup and then re-join things

Let's try the latter....

Use Ocean as a running example?

Think of dynamic program execution as a graph

  nodes -- "chunk" of computational work
  edges -- dependences

Usually:

* some nodes do "different" work on "the same" data
* some nodes do "the same" work on "different" data

Putting together:

nodes with "different" work on "the same" data ==> functional, pipelined parallelism
nodes with "the same" work on "different" data ==> data parallelism

Usually more data parallelism, especially as data gets larger.

Also overheads will be mostly between nodes NOT put together
Thus, want to join nodes that would otherwise have a lot of overhead

Overheads:
    Dependences
    Synchronization
    Communication


If graph structure and node computation time predictable 
THEN statically group (a.k.a. statically partition).

ELSE dynamically group (a.k.a. dynamically partition)

Consider work queue for dynamically partitioning work
* small vs. large elements
* centralize vs. distributed