--------------------------------------------------------------------
CS 757 Parallel Computer Architecture
Spring 2012 Section 1
Instructor Mark D. Hill
--------------------------------------------------------------------

---------
MPI, etc.
---------

Download Jayaram's Slides

Write Outline on Board

* HW1 Due
* Message-Passing Overview
* MPI
* MPI Programming Discussion
* Optional from 757 PP notes

Message-Passing Overview (15 minutes?)

Draw single-threaded program making memory references
to its own address space.

Duplicate this (and imply replications)

Okay but what it you want to communicate?

* Write data into a buffer
* SEND data in buffer
* To whom?  Need namespace for processes (including yourself)
  (Add this to picture)
* RECEIVE allocates buffer to catch data.  Either:
    + Waits for data to arrive
    + Or data has already arrives (who buffers?)

Consider: Namespace, Operations, Ordering

Namespace
*(process, address)

Operations
  "Normal" w/i processes
  Send/receive between processes

Ordering
* Process and Message Ordering is automatic/only "synchronization"
* Example

     P1       P2

     A        B
     C---+
	 +--->D

     E        F

  Know A->C->E & B->D->F & C->D
  Implies, e.g., A->
  Don't know A<>B, C<>B, or even E<>B, etc.


Could write separate program for each process

* Use Single Program Multiple Data (SPMD) (twist on SIMD)
    + case statement on rank still allows arbitrary program
    + Usually rank is use to partition data and work


------------------------------------------

*** Jayaram Bobba's Slides on MPI ***

------------------------------------------

MPI Programming Discussion

Message Latency & Bandwidth depend on implmenation
Extremes
* user-level shared memory library between threads in a single address space
* Library->OS->TCP->IP->IObus->Iocard->Ethernet->and back up


Communication Cost = Frequency * (Overhead + Latency + Size/Bandwidth - Overlap)

Overhead is work done "inline" by processor for send/receive
Latency is addition delay for first byte

Often fixed large overhead/latency for first byte
==> Amortize with "large" message

Hide latency by
* Sending "early"
* Receiving "late"

------------------------------------------

Scaling 
-------
[Notes "Parallel Programming" 50-54]

ALREADY DONE {
Problem-Constrained Speedup(P) = Time(1)/Time(P)
Time-constrained SpeedupTC(P) = Work(P)/Work(1) 
Also memory-constrained

Scaling down -- working set knees & computation-communication ratio
}


Parallel Programming
-------
[Notes "Parallel Programming" 6]
Naming/Operations/Ordering/Performance
Go back to sequential, data parallel, shared-memory and message-passing.
Naming - how is communicated data named & how do partner nodes reference?
Operations - what operations on named data
Ordering - how do producers/consumer coordinate and what's implicit
Performance - latency & BW

Creating Parallel Program
----
[ Notes "Parallel Programming" 10 Notes "Parallel Programming" 30+]
partition work, balance it, minimize communication, reduce extra work, tolerate latency -- in conflict

Data vs. functional parallelism

Granularity of "TASK" (task != thread)
* large tasks: worse load balancing, lower overhead, less contention, less communication
* small tasks: too much synch, too much comm., too much mgmt overhead

Approach
* Static -- lower overhead but can't adapt
* Dynamic -- converse
  E.g., central work queue, per "node" work queue with work stealing (Cilk)

Granularity of Synchronzation  (already done)
* Whole B-Tree
* Each B-Tree node (path)


What is synchronization namespace?
  Shared memory -- all words (see issue if add barrier hardware)
  Message-passing -- messages w/ node IDs and tags

--------------------------------------------