Approximate Lectures Notes for
SIMD (Ignored by Book) and Future Directions (Chapter 12)
Mark D. Hill (revised May 2003)
WE WON'T COVER MOST OF THIS
Outline
-------
SIMD
Data Parallel
CM
BSP
PIM
Future
Chapter 12
Dataflow (past?)
Quantum Computing
Great Communication Inventions
Single Instruction Multiple Data (SIMD)
=======================================
[Hillis/Steele] for SW
[Kuck/Stokes] for BSP
[Leiserson et al.] for CM-5
Model:
Sequencer --- many P M --- host
Host dispatches "macro" instructions and does some scalar instruction
Sequencer broadcasts to bit-serial processors (e.g., 64K)
Ideal for data parallel model
actually DP is an abstraction of SIMD
E.g., CM-1 and CM-2
All instructions subject to "context" flag
Processor "selected" if context=1
Saving/retoring context unconditional
and, or, not, etc. conditional
support for interget and FP?
host broadcasts globals
send instrn
time multiplex to get virtual processor per datum
Sum in Log Time
Hillis/Steele Figure 1
for j = 1 to log2(n)
for all k in parallel do
if (k+1) mod 2^j == 0 do
x[k] = x[k - 2^(j-1)] + x[k]
All Partial Sum in Log Time
Hillis/Steele Figure 2
if (k+1) mod 2^j == 0 do ==> if k >= 2^j do
Can also do and, or, etc.
Quiz: gather positive x[k] into y[]
if x[k] > 0 then i[k]= 1 else i[k]=0
do partial sums on i[k]
if x[k] > 0 then y[i[k]] = x[k]
Thinking Machines
-----------------
Large array of processing elements w/ front end
Early: Goodyear MPP, Illinois Illiac IV
CM-1 [Tucker/Robinson, IEEE Computer 8/88]
1986
1-4 front ends
64K bit-serial data processors
4Kb memory
ALU and accumulator
Hypercube & 2D-grid interface
CM-2 1987
evolutionary upgrade
64Kb memory/PE, optional FPU
CM-5
~1991
Not SIMD, but MIMD with Data Parallel support
PEs w/ microprocessors
plus vector units
plus control network
plus data parallel compilers
but
vector units hard to use
control network overkill
Other SIMD Approaches
---------------------
Short, Parallel SIMD
E.g., MMX
Pipelined SIMD ==> Vectors
short as in Cray-1
long as in CDC Cyber 205 (up to 64K)
Medium, Parallel SIMD
E.g., BSP (next)
Processing in Memory (soon)
Burroughs Scientific Processor (BSP)
------------------------------------
Interesting, in part, because it focuses on dependent,
rather than independent, operations (like Multiscalar)
FAPAS pipeline
F: 16 of 17 memory elements (fetch)
A: alignment network (align)
P: 16 processor elements (process)
A: alignment network (align)
S: 16 of 17 memory elements (store)
What instructions?
Vector instructions with 1-5 operands
Simple: Z[i] = A[i] + B[i]
5-op: Z[i] = (A[i] op B[i]) (C[i] op D[i]) op E[i]
Reduce: X[i] = A[i,0] op A[i,1] op A[i,2] ...
expand, merge, compress
Why 17 memory banks?
prime number of mitigate conflicts
At each address use 16 banks and leave one empty
bank address = A div 16
bank number = A mod 17
use recurrence to avoid non-power-of-two mod operation
BN = BN + 1
if (BN==17) BN = 0
Example w/ 4+1
Bank Bank
Address Number
0 1 2 3 4
0 0 1 2 3 --
1 5 6 7 -- 4
2 10 11 -- 8 9
3 15 -- 12 13 14
Processing in Memory (PIM)
--------------------------
Another hope for SIMD?
[Gokhale, Holmes, Iobst, Terasys PIM]
Need general purpose system (e.g., UP or MIMD)
Selectively need massive SIMD
Solution: Integrate SIMD in Memory
Can operate like normal DRAM
Or Do SIMD processing
See Figure 1 (validation system)
Terasys board has 4Kx25b application-specific microcode table
Host uses memory-mapped store to select microinstruction
Terasys broadcast it to PIM array.
Global Communication
global OR
partitioned OR
parallel prefix (including nearest neighbor communication)
PIM SRAM
Can look like conventional 32Kx4 SRAM
Or 2Kx64 w/ 64 1b processors
On each clock cycle, processor can
load or store on memory (not both)
load from n/w
store to n/w
perform abc or ab+ac+bc or c+ab
Global Communication
global OR
OR of 32K unit returned to host in one tick
partitioned OR
OR of power of 2 fed back in one tick
parallel prefix
15 levels
level 0 nearest neighbor communication in one tick
level 1 even processors i to both i-1 and i-2
level 1 4n processors to 4n-[4,3,2,1]
level 15 broadcast
Software: dataparallel bit c (dbC)
ANSI C superset
adds parallel arbitrary-length bit streams
...
Will it work?
- not COTS (common off-the-shelf) Technology
+ BW internal to memory chip very high
* signal processing?
* code breaking?
(including nearest neighbor communication)
Critique of SIMD
----------------
+ saves sequencing
- microprocessor are great sequencers
+ great parallelism
- but what of amdahl's law
Future Trends
=============
From Chapter 12
In 1999
#machines #processors
10s 1000s
1,000s 100s
100,000s 10s
1,000,000s few
100,000,000s 1
processor speed 100x/decade
dram speed 2x/decade
==> more latency tolerance and caching
latency tolerance
dynamic instrn issue + non-blocking w/ better prediction
multithreading
MP
but remember Little Law:
latency to be hidden is L
rate of long-latency request is R
==> need to average R*L outstanding requests
==> P processors have P*R*L requests outstanding
==> L could increase with P processors + congestion
better caches
+ trans/microprocessor 30x/decade
- memory capacity 100x/decade
currently designer minimize for a stream of accesses:
HitRate(S)*HitTime = [1 - MissRate(S)]*MissTime
w/ perfect latency tolerance:
max(HitRate(S)*HitTime, [1 - MissRate(S)]*MissTime)
(furthermore, misses use some hit time resources)
Non-Linear Effects
Recall
Minicomputers exploded with MSI/VSI possible
Killer micro enabled by VLSI
fast UPs
practical SMPs
System on a Chip (SOC)
PC? No, 1G transistors by 2010 ==> < 1Gb = 128MB DRAM
sub-PC: PDA, wireless 802.11 & cellular
Dataflow (more past than future?)
---------------------------------
The ultimate latency tolerance is "dataflow"
No PC
View program as "graph"
f = a*b + c*d
subset of operations ready to "fire"
pick one and execute
enable new instruction to fire
repeat until done
"tagged-token dataflow architecture" (TTDA) added tags to
permit multiple dynamic instances of a static instruction
be in execution
many of these ideas are now used in "window" of dynamic processor
Problems
Require large associative namespace of operands ready to go
This synchronization/storage allocation happens on every instruction
Explicit Token-Store Architecture
[Papadopoulos & Culler 1990]
Associate with a "block" of code (a function) some data memory (a frame)
token:
ip points to intrns that includes
operation
frame offset
one or more destination instructions
Loop
grab a token
lookup instrn
look at frame+offset
if partner value there
then
execution instrn
create and store tokens for each destination instrn
else
store value at value there
Note frame locations have presence bit and left/right operand indicator
What happened?
Kept von Neumann model
Dataflow w/ window where synchronization space tractable
E.g., physcial registers use "single assignment"
Quantum Computing (QC)
-----------------
C.f., Molecular Computing in Deych & Nugent project
[Daniel Gottesmans's Home Page]
[Gershenfled & Chuang, Scientific American]
Recall from quantum mechanics: eletron in many state until observed
Classic computer: many bit in one state
QC: many quantum bits (qubits) in superposition of many states
(until observed)
QC theory
E.g., Peter Shor solved subproblem of RSA public key cryptography
Discrete log is y = x^r mod n is easy
Inverse discrete log is hard: given x, n, y find r
Shor's QC algorithm calculates for for all possible r at once
Gets a periodic function with peaks at 1/r
Take Fourier transform to find r
QC engineering
Use chloroform (CHCL3) with NMR to get 2 qubits (at most 10 are possible)
a long way to go....
Recall Technology Paradox
Old
Mechanical-->Tubes-->Transistors-->ICs
Many-->Core-->Semiconductor Memory
Many-->Magnetic Disks
Great Communication Inventions
------------------------------
Writing
Printing Press
Telegraph
Telephone
Radio
Television
Internet
Last 40 years / Next 40 years