--------------------------------------------------------------------
CS 757 Parallel Computer Architecture
Spring 2012 Section 1
Instructor Mark D. Hill
--------------------------------------------------------------------

Single Instruction Multiple Data (SIMD)


Outline
-------
  Review Data Parallel
  CM-1,2,5
  PIM
  (BSP)


References:

[Hillis/Steele] for SW
[Leiserson et al.] for CM-5
[Kuck/Stokes] for BSP

Review Data Parallel Model
--------------------------


Sum in Log Time
	Hillis/Steele Figure 1

        for j = 1 to log2(n)
            for all k in parallel do
                if (k+1) mod 2^j == 0 do
                    x[k] = x[k - 2^(j-1)] + x[k]

All Partial Sum in Log Time
	
	Hillis/Steele Figure 2

	if (k+1) mod 2^j == 0 do  ==> if k >= 2^j do

Can also do and, or, etc.

SKIP {{

Quiz: gather positive x[k] into y[]

	if x[k] > 0 then i[k]= 1 else i[k]=0
	do partial sums on i[k]
	if x[k] > 0 then y[i[k]] = x[k]

}}


Large-Scale Paralel SIMD
------------------------

Large array of processing elements w/ front end
	Early: Goodyear MPP, Illinois Illiac IV
	
     Sequencer --- many P M --- host

Host dispatches "macro" instructions and does some scalar instruction
Sequencer broadcasts to bit-serial processors (e.g., 64K)

Literal realization of DP
Or DP is an abstraction of SIMD

If threads diverge -- e.g., then/else clause -- mask and do one at a time


Thinking Machines
-----------------

CM-1 [Tucker/Robinson, IEEE Computer 8/88]
	1986
	1-4 front ends
	64K bit-serial data processors
		4Kb memory
		ALU and accumulator
		Hypercube & 2D-grid interface
	
CM-2	1987
	evolutionary upgrade
	64Kb memory/PE, optional FPU


E.g., with CM-1 and CM-2
    All instructions subject to "context" flag
    Processor "selected" if context=1
    Saving/retoring context unconditional
    and, or, not, etc. conditional
    support for interget and FP?
    host broadcasts globals
    send instrn
    time multiplex to get virtual processor per datum


Reviews
* David: When is higher VP ratio better?, Tony: VP interesting, Syed: VP, AndrewN. : VP vs. timesharing
* Shijin & Syed: Hypercube? & Marc: Grid vs. hypercube
* Brian:  Data parallel today -- Nvidia GeForce 8 for Wednesday
* Guoliang & Andrew N.: Control vs. data parallelism?

* Aditya: processor pipelined? no.
* Daniel: CM reliability?
 
CM-5


The Network Architecture of the Connection Machine CM-5,
Charles E. Leiserson and others,
The Journal of Parallel and Distributed Computing,
March 15, 1996 (revised from SPAA 1992).
	-
	~1991
	Not SIMD, but MIMD with Data Parallel support

	PEs w/ microprocessors
	plus vector units
	plus control network
	plus data parallel compilers
	but
	    vector units hard to use
	    control network overkill

Draw/Show Figure 1.

Network interfaces access via uncacheable loads and stores
mapped to user addresses.

Draw/Show Figure 2 Fat Tree

Draw/Show Figure 3 CM-5 Data N/W
Short messages -- five words?
Coarse timesharing via all-fall-down

Control N/W -- broadcasts and combining (of associative operations)
E.g., data N/W emtry when SUM send - SUM received = 0.


Other SIMD Approaches
---------------------

Short, Parallel SIMD
	E.g., MMX

Pipelined SIMD ==> Vectors
	short as in Cray-1
	long as in CDC Cyber 205 (up to 64K)

Medium, Parallel SIMD
	E.g., BSP (next)

Processing in Memory (next)


Processing in Memory (PIM)
--------------------------
Another hope for SIMD?
[Gokhale, Holmes, Iobst, Terasys PIM]

Need general purpose system (e.g., UP or MIMD)
Selectively need massive SIMD

Solution: Integrate SIMD in Memory
    Can operate like normal DRAM
    Or Do SIMD processing
    See Figure 1 (validation system)


Terasys board has 4Kx25b application-specific microcode table
Host uses memory-mapped store to select microinstruction
Terasys broadcast it to PIM array.

Global Communication
	global OR
	partitioned OR
	parallel prefix (including nearest neighbor communication)

PIM SRAM 
   Can look like conventional 32Kx4 SRAM
   Or 2Kx64 w/ 64 1b processors

   On each clock cycle, processor can
    load or store on memory (not both)
    load from n/w
    store to n/w
    perform abc or ab+ac+bc or c+ab

Global Communication
    global OR
	OR of 32K unit returned to host in one tick
    partitioned OR
	OR of power of 2 fed back in one tick
    parallel prefix
	15 levels
	level 0 nearest neighbor communication in one tick
	level 1 even processors i to both i-1 and i-2
	level 1 4n processors to 4n-[4,3,2,1]
	level 15 broadcast

Software: dataparallel bit c (dbC)
	ANSI C superset
	adds parallel arbitrary-length bit streams
	...

Will it work?
   - not COTS (common off-the-shelf) Technology
   + BW internal to memory chip very high
	* signal processing?
	* code breaking?
    
    (including nearest neighbor communication)


Critique of SIMD
----------------
+ saves sequencing
- microprocessor are great sequencers
+ great parallelism
- but what of amdahl's law


Burroughs Scientific Processor (BSP) (OPTIONAL)
-----------------------------------------------

Interesting, in part, because it focuses on dependent,
rather than independent, operations (like Multiscalar)

FAPAS pipeline

	F: 16 of 17 memory elements (fetch)

	A: alignment network (align)

	P: 16 processor elements (process)

	A: alignment network (align)

	S: 16 of 17 memory elements (store)

What instructions?
	Vector instructions with 1-5 operands
	Simple:   Z[i] = A[i] + B[i]
	5-op:     Z[i] = (A[i] op B[i]) (C[i] op D[i]) op E[i]
	Reduce:   X[i] = A[i,0] op A[i,1] op A[i,2] ...
	expand, merge, compress
	
Why 17 memory banks?
	prime number of mitigate conflicts
	At each address use 16 banks and leave one empty
	bank address = A div 16
	bank number =  A mod 17
	use recurrence to avoid non-power-of-two mod operation
		BN = BN + 1
		if (BN==17) BN = 0

        Example w/ 4+1

        Bank        Bank
        Address        Number

                0  1  2  3  4

        0       0  1  2  3 --
        1       5  6  7 --  4
        2      10 11 --  8  9
        3      15 -- 12 13 14