-------------------------------------------------------------------- CS 757 Parallel Computer Architecture Spring 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- Dataflow Outline Basic Idea (Tagged-Token) Dataflow (long) Explicit Token Store (store) (Wavescalar (elsewhere)) Wrapup Dataflow on Objects Basic Idea ---------- Dataflow Architectures and Multithreading Ben Lee and A.R. Hurson IEEE Computer, August 1994 The ultimate latency tolerance is "dataflow" No PC View program as "graph" f = a*b + c*d subset of operations ready to "fire" pick one and execute enable new instruction to fire repeat until done Plus: Lots of parallelism go fast hide latency Natural synchronization Deterministic (?) First choice: Static: Each "node" active "one at a time" (like disallowing multiple dynamic instances of a same static instruction in a out-of-order processor with speculative execution) "Fire" node if tokens on inputs and NO TOKENS on outputs Dynamic: Allow multiple instances But must not confuse things, therefore, "tag" tokens but more complex (Tagged-Token) Dataflow ----------------------- Executing a Program on the MIT Tagged-Token Dataflow Architecture, Arvind and R. S. Nikhil, IEEE Trans. on Computers, March 1990, pp. 300-318. Reprinted in HJ&S pp. 323-341. But what about? (1) Code Reuse (e.g., loops * functions) (2) Large data structures (arrays) (3) Associative matching of tokens (4) Memory (update) order (1) Code Reuse (e.g., loops * functions) Get more parallelism by allows loop iteration in parallel Do not want to require function "inlining" Idea: allow nodes (instruction) to be active in multiple contexts (think function and frame pointer) Incoming data becomes "token" with (context, destination-address, value, left|right) ( (context, destination-address) is called "tag" Tags must match to fire dyadic node. Show Figure 3 "animation" Actual implement can batch nodes into "code blocks" (2) Large data structures (arrays) Have tokens carry "references" to I-structure, not values I-structure is like memory location with empty/full bit, guaranteed to be written once (before reclamation) write +--------------->present | ^ Init-->absent+ |write | | +------->waiting---+ read (3) Associative matching of tokens Newly-created token does logically associative lookup to find "partner" to dyadic instruction Implement with hashing? Big improvement: Explicit Token-Store Architecture (below) (4) Memory (update) order Doesn't matter! I-Structures are write-once (single-assignment) Dataflow language ID is mostly functional (order doesn't matter) All kinds to cool PL stuff (e.g. partial application of functions) PL people love function language, no one less seems to agree Want dataflow for imperative languages? See: Washington WaveScalar (below) Explicit Token-Store Architecture ----------------------- [Papadopoulos & Culler 1990] Associate with a "block" of code (a function) some data memory (a frame) token: ip points to intrns that includes operation frame offset one or more destination instructions Loop grab a token lookup instrn look at frame+offset if partner value there then execution instrn create and store tokens for each destination instrn else store value at value there Note frame locations have presence bit and left/right operand indicator What happened? Kept von Neumann model Dataflow w/ window where synchronization space tractable E.g., physical registers use "single assignment" OMIT {{{ Wavescalar (elsewhere) ---------------------- References: WaveScalar, Steve Swanson, Ken Michelson, Andrew Schwerin and Mark Oskin, International Symposium on Microarchitecture (MICRO-36), December 2003. Threads on the Cheap: Multithreaded Execution in a WaveCache Processor, Steve Swanson, Andrew Schwerin, Andrew Petersen, Mark Oskin and Susan Eggers, ISCA Workshop on Complexity-effective Design, June 2004. Key ideas: (1) Order memory operations of single "thread" with "waves" so that (potentially) conflicting operations are totally ordered. Effectively creates "program order" [Lamport] Parallel through non-memory operations and memory disambiguation Issues with conditionals (2) Spatial assign "code block" to tiled grid of processor elements (not a focus for CS/ECE 757) *** Use MICRO-36 Slides here *** (3) Second paper add thread-id to memory order Appears per-thread "program orders" merged into global total order (like sequential consistency [Lamport]) C.f. UT TRIPS }}} Wrapup ------ Dataflow in out-of-order processors * CDC 6600 Scoreboarding like static dataflow * Rest like dynamic dataflow vonNeumann has won for now * less overhead * dataflow inside * branch prediction * caching Dataflow on Objects ------------------- Dataflow Execution of Sequential Imperative Programs on Multicore Architectures Gagan Gupta and Gurindar S. Sohi MICRO 2011 Paper: ftp://ftp.cs.wisc.edu/sohi/papers/2011/MICRO_2011_Dataflow.pdf /afs/cs.wisc.edu/u/m/a/markhill/public/html/restricted/micro11_dataflow.pdf Slide pptx: see email ca. Mar 29, 2012 /afs/cs.wisc.edu/u/m/a/markhill/public/html/restricted/micro11_dataflow_talk.pdf