--------------------------------------------------------------------
CS 757 Parallel Computer Architecture
Spring 2012 Section 1
Instructor Mark D. Hill
--------------------------------------------------------------------

Outline
* Methods -- Speedup
* Splash-2 
* (PARSEC)
* Hill & Marty Talk

-------------------


SPEED UP -- Need to beef up

Speedup(P) = Speed(P)/Speed(1) = Time(1)/Time(P)

Actually this is "problem constrained" or Strong Scaling

What is use faster machine to solve bigger problem?

Can't use

	Time(1) / Time-Big(P) = 1


	Work(P) / Work (1) or Weak Scaling

	What is work?


**MORE HERE**

Aside: Cost-Effective Parallel Computing

	Isn't Speedup(C) < C Inefficient? (C = #cores)

	Much of a Computer's Cost OUTSIDE Processor
	[Wood & Hill, IEEE Computer 2/1995]

	Let Costup(C) = Cost(C)/Cost(1)

	
	Parallel Computing Cost-Effective:
         	Speedup(C) > Costup(C)
	 1995 SGI PowerChallenge w/ 500MB:
	          Costup(32) = 8.6

	Multicores haveven lower Costups!!!


--------------------
NOT USED

Back to Multicore Helps Power

(Over) Simply
  Performance1 = 1 Core x Frequency
  Power1 = 1 Core x Frequency x Voltage^2

Duplicate Cores & Halve Frequency
  Performance2 = 2 x Core x Frequency / 2 = Performance1
  Power2 = 2 x Core x Frequency/2 x Voltage^2 = Power1

Then Scale Voltage
  Performance2' = Performance1
  Power2' = 2 x Core x Frequency/2 x (Voltage/k)^2 = Power1/k where k may be 2


Leftovers from Sutter&Larus

How important is determinism? (David Hinkemeyer)
Functional programming & parallelism (Marc deK)
VHDL/Verilog (Aditya Godse)


--------------------


Woo et al., SPLASH-2, ISCA 1995.

Emphasis Points from Reviews:

* False Sharing - Syed Gilani, Tony Gregerson, Marc deK, Asim Kadav
* Working Sets - Andy Nere, Guoliang Jin
* Comp/Comm Ratio - Brian Hickmann
* Know Apps - Shijin Kong
* Inceasing memory gap & PRAM, Eric Reiss


Presents Splash-2, characterizes them,
& gives some methodological advice

Presents Splash-2 (Section 3)
8 applications & 4 kernels


Concurrency & load balance (Section 4)

Figure 1 gives PRAM speedups
All good, but LU, Cholesky, Radix, and Radiosity

Figure 2 shows LU, Cholesky, & Radiosity do
too much synchronization (for this data size)

Radix's problem is is tree-like parallel prefix
component

Working Sets (5)

Have "knees"  (e.g., Figure 3 shows Barnes has 3)
Very important to scaling to have appropriate
WS in or out of caches.

Communication to Computation Ratio (6)

Aside: Communication in shared memory implicit with cache coherence

   If processor R read data last written by processor W, 4-case cross-product:

   * at processor R, either invalid or non-present
   times
   * at processor W, either modified or non-present (I simplify)

   In all four cases, processor R effectives cache misses.


Adding processor without increasing problem size
will usually increase CCR.
Must pay attention to in methods


Spatial Locality & False Sharing (7)

With perfect spatial locality, doubling block
size halves miss ratio

Actual results mixed

--------------------------------

Programming Models


Definitions

   Thread -- PC
   Process -- address space

Basic Programming Models (draw pictures)
  Multitasking:		n x ( 1 thread/process) w/o communication
  Message-Passing 	n x ( 1 thread/process) w/ messages
  Shared Memory:	n threads in 1 process (how communcate?)
  Shared Memory': 	n x ( 1 thread/process) w/ "System V" shared memory
  Sequential: 		1 thread/process with parallelizing software
  Data Parallel         1 thread/process with data parallel ops
  (generalization of SIMD)

	
Simple Problem (NOT USED YET)

	for i = 1 to N
			A[i] = (A[i] + B[i]) * C[i]
			sum = sum + A[i]
	Split the loops
	Independent iterations 
		for i = 1 to N
			A[i] = (A[i] + B[i]) * C[i]
		for i = 1 to N
			sum = sum + A[i]
	Data flow graph?

Shared Memory

	private int i, my_start, my_end, mynode;
	shared float A[N], B[N], C[N], sum;
	for i = my_start to my_end
		A[i] = (A[i] + B[i]) * C[i]
	GLOBAL_SYNCH;
	if (mynode == 0)
		for i = 1 to N
			sum = sum + A[i]

Message Passing

	int i, my_start, my_end, mynode;
	float A[N/P], B[N/P], C[N/P], sum;
	for i = 1 to N/P
		A[i] = (A[i] + B[i]) * C[i]
		sum = sum + A[i]
	if (mynode != 0)
		send (sum,0);
	if (mynode == 0)
		for i = 1 to P-1
			recv(tmp,i)
			sum = sum + tmp


----------------