--------------------------------------------------------------------
CS 757 Parallel Computer Architecture
Spring 2012 Section 1
Instructor Mark D. Hill
--------------------------------------------------------------------

Outline
* Programming Models
* OpenMP (see 838 notes?)
* Hillis & Steele


--------------------------------

Programming Models

REVIEW {

Definitions

   Thread -- PC
   Process -- address space

Basic Programming Models (draw pictures)
  Multitasking:		n x ( 1 thread/process) w/o communication
  Message-Passing 	n x ( 1 thread/process) w/ messages
  Shared Memory:	n threads in 1 process (how communcate?)
  Shared Memory': 	n x ( 1 thread/process) w/ "System V" shared memory
  Sequential: 		1 thread/process with parallelizing software
  Data Parallel         1 thread/process with data parallel ops
  (generalization of SIMD)

}
Simple Problem

	for i = 1 to N
			A[i] = (A[i] + B[i]) * C[i]
			sum = sum + A[i]

	Split the loops
	Independent iterations 

		for i = 1 to N
			A[i] = (A[i] + B[i]) * C[i]
		for i = 1 to N
			sum = sum + A[i]

	Data flow graph?

	In a perfect world, compiler would paralelize
	Works for this case, but not robust


Mention SPMD -- Single Program Multiple Data
       Write one program
       Copy onto many thread/nodes
       Read key variable to do different work
         	Same work on differ data
		Completely different
	
Shared Memory

	private int i, my_start, my_end, mynode;
	shared float A[N], B[N], C[N], sum;
	for i = my_start to my_end
		A[i] = (A[i] + B[i]) * C[i]
	GLOBAL_SYNCH;
	if (mynode == 0)
		for i = 1 to N
			sum = sum + A[i]

Message Passing

	int i, my_start, my_end, mynode;
	float A[N/P], B[N/P], C[N/P], sum;
	for i = 1 to N/P
		A[i] = (A[i] + B[i]) * C[i]
		sum = sum + A[i]
	if (mynode != 0)
		send (sum,0);
	if (mynode == 0)
		for i = 1 to P-1
			recv(tmp,i)
			sum = sum + tmp


Data Parallel

	shared float A[N], B[N], C[N]; (optional directives on data layout)
	A = (A + B) * C;
	sum = reduce_add(A);

	Optional directives on data layout
	Often "owner's compute"
	Compiler manage communication
	Performance disappointing


--------------------------------

OpenMP

Simple Use (see documentation for richer use):

1. Write sequential program (with eye to parallelization)
2. Add OMP directives to parallelize key loops


Example from addmatrix.c:

// does A = B + C (write without pragma first)
  
  // handle the outer loop in parallel
  #pragma omp parallel for \
    shared(A,B,C,xdim,ydim) \
    private(i,j) \
    schedule(dynamic)
  for(j=0;j<ydim;j++) {

    // nothing special will be done to parallelize the inner loop--
    // this section is already done in parallel
    for(i=0;i<xdim;i++) {
      A[j][i] = B[j][i] + C[j][i];
    }
    
  } // end of parallel for (implicit barrier)
        
}

Control Model

 sequential thread 0
 ...
 begin parallel loop
 Thread 0 continues & FORK thread 1 to n-1
 (threads do different loop iterations)
 JOIN back to just thread 0
 ...
 repeat sequential/parallel phases changes

Comments
* Loop iterations must be independent (in this simple use)
* Which threads to which iterations -- static/dynamic directives
* Data parallel (but can do anything)

Limits/Discussion

--------------------------------

W. Daniel Hillis and Guy L. Steele,
Data Parallel Algorithms,
Communications of the ACM,
December 1986, pp. 1170-1183. 

A SIMD Implementation (CM-1/2)

  Program and control in front end (e.g., "one program counter")

  Many physical processors (e.g,. 64K)
  
  Each as 512B of memort and enable flag(s)
  
  Broadcast instructions
    Do if enabled
    add, max
    send, receive
  

Data Parallel -- abstraction of SIMD

  Virtual processor per datum (multipled on actual processors)
  
  High send/receive to look like global memory
  
  See Figure 1 (parallel add)

   for j = 1 to log n
	for all k in parallel
	    if ((k+1) mod 2^i))==0 
		x[k] = x[k - 2^(j-1)] + x[k]

  See Figure 2 (parallel prefix add):

   for j = 1 to log n
	for all k in parallel
	    if (k >=  2^i)  *** only change ***
		x[k] = x[k - 2^(j-1)] + x[k]
	

  Works for any associative operation (e.g., MAX).o

  Very powerful.  Say we want to gather together the non-zero element of 
  x[] into y[].

    Everyone test whether their x[i] and set nz[i]
    Do parallel prefix add on nz[i] into idx[i]
    /* nz[i] gives the number of non-zero elements up to x[i] */
    if (nz[i]==1) y[nz[i]] = x[i]


  Papers go on to many others.

Issues from Reviews

* VP abstraction --Tony
* Flexible communication?  --Daniel

* Count/enumerate unit time? --Shijin
* Idle VPs during tree calcualtion  --Aditya

* Linked list --David, Marc, Eric (+jump ptrs, -caching, -communication)
* Combinator Reduction --Syed (forget it)
* Future?  --Asim (leave to end

C.f., CM-5 & Nvidia CUDA