-------------------------------------------------------------------- CS 757 Parallel Computer Architecture Spring 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- --------- MPI, etc. --------- Download Jayaram's Slides Write Outline on Board * HW1 Due * Message-Passing Overview * MPI * MPI Programming Discussion * Optional from 757 PP notes Message-Passing Overview (15 minutes?) Draw single-threaded program making memory references to its own address space. Duplicate this (and imply replications) Okay but what it you want to communicate? * Write data into a buffer * SEND data in buffer * To whom? Need namespace for processes (including yourself) (Add this to picture) * RECEIVE allocates buffer to catch data. Either: + Waits for data to arrive + Or data has already arrives (who buffers?) Consider: Namespace, Operations, Ordering Namespace *(process, address) Operations "Normal" w/i processes Send/receive between processes Ordering * Process and Message Ordering is automatic/only "synchronization" * Example P1 P2 A B C---+ +--->D E F Know A->C->E & B->D->F & C->D Implies, e.g., A-> Don't know A<>B, C<>B, or even E<>B, etc. Could write separate program for each process * Use Single Program Multiple Data (SPMD) (twist on SIMD) + case statement on rank still allows arbitrary program + Usually rank is use to partition data and work ------------------------------------------ *** Jayaram Bobba's Slides on MPI *** ------------------------------------------ MPI Programming Discussion Message Latency & Bandwidth depend on implmenation Extremes * user-level shared memory library between threads in a single address space * Library->OS->TCP->IP->IObus->Iocard->Ethernet->and back up Communication Cost = Frequency * (Overhead + Latency + Size/Bandwidth - Overlap) Overhead is work done "inline" by processor for send/receive Latency is addition delay for first byte Often fixed large overhead/latency for first byte ==> Amortize with "large" message Hide latency by * Sending "early" * Receiving "late" ------------------------------------------ Scaling ------- [Notes "Parallel Programming" 50-54] ALREADY DONE { Problem-Constrained Speedup(P) = Time(1)/Time(P) Time-constrained SpeedupTC(P) = Work(P)/Work(1) Also memory-constrained Scaling down -- working set knees & computation-communication ratio } Parallel Programming ------- [Notes "Parallel Programming" 6] Naming/Operations/Ordering/Performance Go back to sequential, data parallel, shared-memory and message-passing. Naming - how is communicated data named & how do partner nodes reference? Operations - what operations on named data Ordering - how do producers/consumer coordinate and what's implicit Performance - latency & BW Creating Parallel Program ---- [ Notes "Parallel Programming" 10 Notes "Parallel Programming" 30+] partition work, balance it, minimize communication, reduce extra work, tolerate latency -- in conflict Data vs. functional parallelism Granularity of "TASK" (task != thread) * large tasks: worse load balancing, lower overhead, less contention, less communication * small tasks: too much synch, too much comm., too much mgmt overhead Approach * Static -- lower overhead but can't adapt * Dynamic -- converse E.g., central work queue, per "node" work queue with work stealing (Cilk) Granularity of Synchronzation (already done) * Whole B-Tree * Each B-Tree node (path) What is synchronization namespace? Shared memory -- all words (see issue if add barrier hardware) Message-passing -- messages w/ node IDs and tags --------------------------------------------