University of Wisconsin Computer Sciences Header Map (repeated with
textual links if page includes departmental footer)
CS/ECE 757: Advanced Computer Architecture II
Spring 2002 Section 1 of 1
Instructor Mark D. Hill and T. A. TBA

Approximate Lectures Notes for SIMD (Ignored by Book) and Future Directions (Chapter 12)


Mark D. Hill (revised May 2003)

WE WON'T COVER MOST OF THIS


Outline
-------
SIMD 
  Data Parallel
  CM
  BSP
  PIM

Future
  Chapter 12
  Dataflow (past?)
  Quantum Computing
  Great Communication Inventions



Single Instruction Multiple Data (SIMD)
=======================================
[Hillis/Steele] for SW
[Kuck/Stokes] for BSP
[Leiserson et al.] for CM-5

Model:


Sequencer --- many P M --- host

Host dispatches "macro" instructions and does some scalar instruction
Sequencer broadcasts to bit-serial processors (e.g., 64K)

Ideal for data parallel model
    actually DP is an abstraction of SIMD

E.g., CM-1 and CM-2

    All instructions subject to "context" flag
    Processor "selected" if context=1
    Saving/retoring context unconditional
    and, or, not, etc. conditional
    support for interget and FP?
    host broadcasts globals
    send instrn
    time multiplex to get virtual processor per datum



Sum in Log Time
	Hillis/Steele Figure 1

        for j = 1 to log2(n)
            for all k in parallel do
                if (k+1) mod 2^j == 0 do
                    x[k] = x[k - 2^(j-1)] + x[k]

All Partial Sum in Log Time
	
	Hillis/Steele Figure 2

	if (k+1) mod 2^j == 0 do  ==> if k >= 2^j do

Can also do and, or, etc.

Quiz: gather positive x[k] into y[]

	if x[k] > 0 then i[k]= 1 else i[k]=0
	do partial sums on i[k]
	if x[k] > 0 then y[i[k]] = x[k]


Thinking Machines
-----------------

Large array of processing elements w/ front end
	Early: Goodyear MPP, Illinois Illiac IV
	
CM-1 [Tucker/Robinson, IEEE Computer 8/88]
	1986
	1-4 front ends
	64K bit-serial data processors
		4Kb memory
		ALU and accumulator
		Hypercube & 2D-grid interface
	
CM-2	1987
	evolutionary upgrade
	64Kb memory/PE, optional FPU

CM-5
	~1991
	Not SIMD, but MIMD with Data Parallel support

	PEs w/ microprocessors
	plus vector units
	plus control network
	plus data parallel compilers
	but
	    vector units hard to use
	    control network overkill



Other SIMD Approaches
---------------------

Short, Parallel SIMD
	E.g., MMX

Pipelined SIMD ==> Vectors
	short as in Cray-1
	long as in CDC Cyber 205 (up to 64K)

Medium, Parallel SIMD
	E.g., BSP (next)

Processing in Memory (soon)


Burroughs Scientific Processor (BSP)
------------------------------------
Interesting, in part, because it focuses on dependent,
rather than independent, operations (like Multiscalar)

FAPAS pipeline

	F: 16 of 17 memory elements (fetch)

	A: alignment network (align)

	P: 16 processor elements (process)

	A: alignment network (align)

	S: 16 of 17 memory elements (store)

What instructions?
	Vector instructions with 1-5 operands
	Simple:   Z[i] = A[i] + B[i]
	5-op:     Z[i] = (A[i] op B[i]) (C[i] op D[i]) op E[i]
	Reduce:   X[i] = A[i,0] op A[i,1] op A[i,2] ...
	expand, merge, compress
	
Why 17 memory banks?
	prime number of mitigate conflicts
	At each address use 16 banks and leave one empty
	bank address = A div 16
	bank number =  A mod 17
	use recurrence to avoid non-power-of-two mod operation
		BN = BN + 1
		if (BN==17) BN = 0

        Example w/ 4+1

        Bank        Bank
        Address        Number

                0  1  2  3  4

        0       0  1  2  3 --
        1       5  6  7 --  4
        2      10 11 --  8  9
        3      15 -- 12 13 14


Processing in Memory (PIM)
--------------------------
Another hope for SIMD?
[Gokhale, Holmes, Iobst, Terasys PIM]

Need general purpose system (e.g., UP or MIMD)
Selectively need massive SIMD

Solution: Integrate SIMD in Memory
    Can operate like normal DRAM
    Or Do SIMD processing
    See Figure 1 (validation system)


Terasys board has 4Kx25b application-specific microcode table
Host uses memory-mapped store to select microinstruction
Terasys broadcast it to PIM array.

Global Communication
	global OR
	partitioned OR
	parallel prefix (including nearest neighbor communication)

PIM SRAM 
   Can look like conventional 32Kx4 SRAM
   Or 2Kx64 w/ 64 1b processors

   On each clock cycle, processor can
    load or store on memory (not both)
    load from n/w
    store to n/w
    perform abc or ab+ac+bc or c+ab

Global Communication
    global OR
	OR of 32K unit returned to host in one tick
    partitioned OR
	OR of power of 2 fed back in one tick
    parallel prefix
	15 levels
	level 0 nearest neighbor communication in one tick
	level 1 even processors i to both i-1 and i-2
	level 1 4n processors to 4n-[4,3,2,1]
	level 15 broadcast

Software: dataparallel bit c (dbC)
	ANSI C superset
	adds parallel arbitrary-length bit streams
	...


Will it work?
   - not COTS (common off-the-shelf) Technology
   + BW internal to memory chip very high
	* signal processing?
	* code breaking?
    
    (including nearest neighbor communication)


Critique of SIMD
----------------
+ saves sequencing
- microprocessor are great sequencers
+ great parallelism
- but what of amdahl's law



Future Trends
=============

From Chapter 12

In 1999

	#machines       #processors
	         10s	1000s
	      1,000s	100s
	    100,000s	10s
	  1,000,000s	few
	100,000,000s	1

processor speed 100x/decade
dram speed 2x/decade
==> more latency tolerance and caching

latency tolerance
    dynamic instrn issue + non-blocking w/ better prediction
    multithreading
    MP
    but remember Little Law:
	latency to be hidden is L
	rate of long-latency request is R
	==> need to average R*L outstanding requests
	==> P processors have P*R*L requests outstanding
	==> L could increase with P processors + congestion

better caches
    + trans/microprocessor 30x/decade
    - memory capacity 100x/decade

    currently designer minimize for a stream of accesses:

	HitRate(S)*HitTime = [1 - MissRate(S)]*MissTime 

    w/ perfect latency tolerance:

	max(HitRate(S)*HitTime, [1 - MissRate(S)]*MissTime)

    (furthermore, misses use some hit time resources)


Non-Linear Effects
Recall
	Minicomputers exploded with MSI/VSI possible
	Killer micro enabled by VLSI
		fast UPs
		practical SMPs
	System on a Chip (SOC)
		PC? No, 1G transistors by 2010 ==> < 1Gb = 128MB DRAM
		sub-PC: PDA, wireless 802.11 & cellular


Dataflow (more past than future?)
---------------------------------

The ultimate latency tolerance is "dataflow" 
    No PC
    View program as "graph"
	f = a*b + c*d
	subset of operations ready to "fire"
	pick one and execute
	enable new instruction to fire
	repeat until done

	"tagged-token dataflow architecture" (TTDA) added tags to
		permit multiple dynamic instances of a static instruction 
		be in execution

	many of these ideas are now used in "window" of dynamic processor

    Problems
	Require large associative namespace of operands ready to go
	This synchronization/storage allocation happens on every instruction


Explicit Token-Store Architecture
[Papadopoulos & Culler 1990]

Associate with a "block" of code (a function) some data memory (a frame)

token:	

ip points to intrns that includes
	operation
	frame offset
	one or more destination instructions

Loop 
	grab a token
	lookup instrn
	look at frame+offset
	if partner value there
	then
		execution instrn
		create and store tokens for each destination instrn
	else
		store value at value there

Note frame locations have presence bit and left/right operand indicator

What happened?
	Kept von Neumann model
	Dataflow w/ window where synchronization space tractable
	E.g., physcial registers use "single assignment"


Quantum Computing (QC)
-----------------
C.f., Molecular Computing in Deych & Nugent project

[Daniel Gottesmans's Home Page]
[Gershenfled & Chuang, Scientific American]

Recall from quantum mechanics: eletron in many state until observed
Classic computer: many bit in one state
QC: many quantum bits (qubits) in superposition of many states
	 (until observed)

QC theory
   E.g., Peter Shor solved subproblem of RSA public key cryptography
   Discrete log is y = x^r mod n is easy
   Inverse discrete log is hard: given x, n, y find r
   Shor's QC algorithm calculates for for all possible r at once
   Gets a periodic function with peaks at 1/r
   Take Fourier transform to find r

QC engineering
   Use chloroform (CHCL3) with NMR to get 2 qubits (at most 10 are possible)
   a long way to go....

Recall Technology Paradox

Old
   Mechanical-->Tubes-->Transistors-->ICs
   Many-->Core-->Semiconductor Memory
   Many-->Magnetic Disks
   

Great Communication Inventions
------------------------------
Writing
Printing Press
Telegraph
Telephone
Radio
Television
Internet

Last 40 years / Next 40 years
CS/ECE 757: Advanced Computer Architecture II Spring 2002 Section 1 of 1 Instructor Mark D. Hill and T. A. TBA

Approximate Lectures Notes for SIMD (Ignored by Book) and Future Directions (Chapter 12)

CS/ECE 757: Advanced Computer Architecture II
Spring 2002 Section 1 of 1
Instructor Mark D. Hill and T. A. TBA