-------------------------------------------------------------------- CS 757 Parallel Computer Architecture Spring 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- Outline * Programming Models * OpenMP (see 838 notes?) * Hillis & Steele -------------------------------- Programming Models REVIEW { Definitions Thread -- PC Process -- address space Basic Programming Models (draw pictures) Multitasking: n x ( 1 thread/process) w/o communication Message-Passing n x ( 1 thread/process) w/ messages Shared Memory: n threads in 1 process (how communcate?) Shared Memory': n x ( 1 thread/process) w/ "System V" shared memory Sequential: 1 thread/process with parallelizing software Data Parallel 1 thread/process with data parallel ops (generalization of SIMD) } Simple Problem for i = 1 to N A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] Split the loops Independent iterations for i = 1 to N A[i] = (A[i] + B[i]) * C[i] for i = 1 to N sum = sum + A[i] Data flow graph? In a perfect world, compiler would paralelize Works for this case, but not robust Mention SPMD -- Single Program Multiple Data Write one program Copy onto many thread/nodes Read key variable to do different work Same work on differ data Completely different Shared Memory private int i, my_start, my_end, mynode; shared float A[N], B[N], C[N], sum; for i = my_start to my_end A[i] = (A[i] + B[i]) * C[i] GLOBAL_SYNCH; if (mynode == 0) for i = 1 to N sum = sum + A[i] Message Passing int i, my_start, my_end, mynode; float A[N/P], B[N/P], C[N/P], sum; for i = 1 to N/P A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] if (mynode != 0) send (sum,0); if (mynode == 0) for i = 1 to P-1 recv(tmp,i) sum = sum + tmp Data Parallel shared float A[N], B[N], C[N]; (optional directives on data layout) A = (A + B) * C; sum = reduce_add(A); Optional directives on data layout Often "owner's compute" Compiler manage communication Performance disappointing -------------------------------- OpenMP Simple Use (see documentation for richer use): 1. Write sequential program (with eye to parallelization) 2. Add OMP directives to parallelize key loops Example from addmatrix.c: // does A = B + C (write without pragma first) // handle the outer loop in parallel #pragma omp parallel for \ shared(A,B,C,xdim,ydim) \ private(i,j) \ schedule(dynamic) for(j=0;j= 2^i) *** only change *** x[k] = x[k - 2^(j-1)] + x[k] Works for any associative operation (e.g., MAX).o Very powerful. Say we want to gather together the non-zero element of x[] into y[]. Everyone test whether their x[i] and set nz[i] Do parallel prefix add on nz[i] into idx[i] /* nz[i] gives the number of non-zero elements up to x[i] */ if (nz[i]==1) y[nz[i]] = x[i] Papers go on to many others. Issues from Reviews * VP abstraction --Tony * Flexible communication? --Daniel * Count/enumerate unit time? --Shijin * Idle VPs during tree calcualtion --Aditya * Linked list --David, Marc, Eric (+jump ptrs, -caching, -communication) * Combinator Reduction --Syed (forget it) * Future? --Asim (leave to end C.f., CM-5 & Nvidia CUDA