--------------------------------------------------------------------
CS 757 Parallel Computer Architecture
Spring 2012 Section 1
Instructor Mark D. Hill
--------------------------------------------------------------------

Dataflow

Outline
   Basic Idea
   (Tagged-Token) Dataflow (long)
   Explicit Token Store (store)
   (Wavescalar (elsewhere))
   Wrapup
   Dataflow on Objects


Basic Idea
----------

Dataflow Architectures and Multithreading
Ben Lee and A.R. Hurson
IEEE Computer, August 1994


The ultimate latency tolerance is "dataflow" 
    No PC
    View program as "graph"
    f = a*b + c*d
    subset of operations ready to "fire"
    pick one and execute
    enable new instruction to fire
    repeat until done

Plus:
    Lots of parallelism
	go fast
	hide latency
    Natural synchronization
    Deterministic (?)


First choice:

Static: Each "node" active "one at a time"  (like disallowing multiple
        dynamic instances of a same static instruction in a out-of-order
	processor with speculative execution)

	"Fire" node if tokens on inputs and NO TOKENS on outputs

Dynamic:  Allow multiple instances
          But must not confuse things,
	  therefore, "tag" tokens
	  but more complex

(Tagged-Token) Dataflow
-----------------------

Executing a Program on the MIT Tagged-Token Dataflow Architecture,
Arvind and R. S. Nikhil, 
IEEE Trans. on Computers, March 1990, pp. 300-318.
Reprinted in HJ&S pp. 323-341.


But what about?
    (1) Code Reuse (e.g., loops * functions)
    (2) Large data structures (arrays)
    (3) Associative matching of tokens
    (4) Memory (update) order

(1) Code Reuse (e.g., loops * functions)


    Get more parallelism by allows loop iteration in parallel
    Do not want to require function "inlining"

    Idea: allow nodes (instruction) to be active in multiple contexts
	 (think function and frame pointer)

	 Incoming data becomes "token" with
	 (context, destination-address, value, left|right)

	 ( (context, destination-address) is called "tag"

	 Tags must match to fire dyadic node.

	 Show Figure 3 "animation"

	 Actual implement can batch nodes into "code blocks"


(2) Large data structures (arrays)

       Have tokens carry "references" to I-structure, not values

       I-structure is like memory location with empty/full bit,
       guaranteed to be written once (before reclamation)


		           write
		    +--------------->present
		    |		       ^
       Init-->absent+                  |write
	            |		       |
		    +------->waiting---+
		      read

(3) Associative matching of tokens

      Newly-created token does logically associative lookup
      to find "partner" to dyadic instruction

      Implement with hashing?

      Big improvement: Explicit Token-Store Architecture (below)


(4) Memory (update) order

       Doesn't matter!
       I-Structures are write-once (single-assignment)
       Dataflow language ID is mostly functional
	 (order doesn't matter)

       All kinds to cool PL stuff (e.g. partial application of functions)

       PL people love function language, no one less seems to agree

       Want dataflow for imperative languages?

       See: Washington WaveScalar (below)


Explicit Token-Store Architecture
-----------------------

[Papadopoulos & Culler 1990]

Associate with a "block" of code (a function) some data memory (a frame)

token:	<value, instrn ptr (IP), frame ptr (fp)>

ip points to intrns that includes
	operation
	frame offset
	one or more destination instructions

Loop 
	grab a token
	lookup instrn
	look at frame+offset
	if partner value there
	then
		execution instrn
		create and store tokens for each destination instrn
	else
		store value at value there

Note frame locations have presence bit and left/right operand indicator


What happened?
	Kept von Neumann model
	Dataflow w/ window where synchronization space tractable
	E.g., physical registers use "single assignment"

OMIT {{{

Wavescalar (elsewhere)
----------------------

References:

WaveScalar,
Steve Swanson, Ken Michelson, Andrew Schwerin and Mark Oskin,
International Symposium on Microarchitecture (MICRO-36), December 2003.

Threads on the Cheap: Multithreaded Execution in a WaveCache Processor,
Steve Swanson, Andrew Schwerin, Andrew Petersen, Mark Oskin and Susan Eggers,
ISCA Workshop on Complexity-effective Design, June 2004.

Key ideas:

(1) Order memory operations of single "thread" with "waves" so that
    (potentially) conflicting operations are totally ordered.

    Effectively creates "program order" [Lamport]

    Parallel through non-memory operations and memory disambiguation

    Issues with conditionals

(2) Spatial assign "code block" to tiled grid of processor elements
    (not a focus for CS/ECE 757)


*** Use MICRO-36 Slides here ***


(3) Second paper add thread-id to memory order

    Appears per-thread "program orders" merged into global
    total order (like sequential consistency [Lamport])

C.f. UT TRIPS

}}}


Wrapup
------

Dataflow in out-of-order processors
* CDC 6600 Scoreboarding like static dataflow
* Rest like dynamic dataflow

vonNeumann has won for now
* less overhead
* dataflow inside
* branch prediction
* caching


Dataflow on Objects
-------------------
Dataflow Execution of Sequential Imperative Programs on Multicore Architectures
Gagan Gupta and Gurindar S. Sohi
MICRO 2011
Paper: ftp://ftp.cs.wisc.edu/sohi/papers/2011/MICRO_2011_Dataflow.pdf
 /afs/cs.wisc.edu/u/m/a/markhill/public/html/restricted/micro11_dataflow.pdf
Slide pptx: see email ca. Mar 29, 2012
 /afs/cs.wisc.edu/u/m/a/markhill/public/html/restricted/micro11_dataflow_talk.pdf