-------------------------------------------------------------------- CS 757 Parallel Computer Architecture Spring 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- Single Instruction Multiple Data (SIMD) Outline ------- Review Data Parallel CM-1,2,5 PIM (BSP) References: [Hillis/Steele] for SW [Leiserson et al.] for CM-5 [Kuck/Stokes] for BSP Review Data Parallel Model -------------------------- Sum in Log Time Hillis/Steele Figure 1 for j = 1 to log2(n) for all k in parallel do if (k+1) mod 2^j == 0 do x[k] = x[k - 2^(j-1)] + x[k] All Partial Sum in Log Time Hillis/Steele Figure 2 if (k+1) mod 2^j == 0 do ==> if k >= 2^j do Can also do and, or, etc. SKIP {{ Quiz: gather positive x[k] into y[] if x[k] > 0 then i[k]= 1 else i[k]=0 do partial sums on i[k] if x[k] > 0 then y[i[k]] = x[k] }} Large-Scale Paralel SIMD ------------------------ Large array of processing elements w/ front end Early: Goodyear MPP, Illinois Illiac IV Sequencer --- many P M --- host Host dispatches "macro" instructions and does some scalar instruction Sequencer broadcasts to bit-serial processors (e.g., 64K) Literal realization of DP Or DP is an abstraction of SIMD If threads diverge -- e.g., then/else clause -- mask and do one at a time Thinking Machines ----------------- CM-1 [Tucker/Robinson, IEEE Computer 8/88] 1986 1-4 front ends 64K bit-serial data processors 4Kb memory ALU and accumulator Hypercube & 2D-grid interface CM-2 1987 evolutionary upgrade 64Kb memory/PE, optional FPU E.g., with CM-1 and CM-2 All instructions subject to "context" flag Processor "selected" if context=1 Saving/retoring context unconditional and, or, not, etc. conditional support for interget and FP? host broadcasts globals send instrn time multiplex to get virtual processor per datum Reviews * David: When is higher VP ratio better?, Tony: VP interesting, Syed: VP, AndrewN. : VP vs. timesharing * Shijin & Syed: Hypercube? & Marc: Grid vs. hypercube * Brian: Data parallel today -- Nvidia GeForce 8 for Wednesday * Guoliang & Andrew N.: Control vs. data parallelism? * Aditya: processor pipelined? no. * Daniel: CM reliability? CM-5 The Network Architecture of the Connection Machine CM-5, Charles E. Leiserson and others, The Journal of Parallel and Distributed Computing, March 15, 1996 (revised from SPAA 1992). - ~1991 Not SIMD, but MIMD with Data Parallel support PEs w/ microprocessors plus vector units plus control network plus data parallel compilers but vector units hard to use control network overkill Draw/Show Figure 1. Network interfaces access via uncacheable loads and stores mapped to user addresses. Draw/Show Figure 2 Fat Tree Draw/Show Figure 3 CM-5 Data N/W Short messages -- five words? Coarse timesharing via all-fall-down Control N/W -- broadcasts and combining (of associative operations) E.g., data N/W emtry when SUM send - SUM received = 0. Other SIMD Approaches --------------------- Short, Parallel SIMD E.g., MMX Pipelined SIMD ==> Vectors short as in Cray-1 long as in CDC Cyber 205 (up to 64K) Medium, Parallel SIMD E.g., BSP (next) Processing in Memory (next) Processing in Memory (PIM) -------------------------- Another hope for SIMD? [Gokhale, Holmes, Iobst, Terasys PIM] Need general purpose system (e.g., UP or MIMD) Selectively need massive SIMD Solution: Integrate SIMD in Memory Can operate like normal DRAM Or Do SIMD processing See Figure 1 (validation system) Terasys board has 4Kx25b application-specific microcode table Host uses memory-mapped store to select microinstruction Terasys broadcast it to PIM array. Global Communication global OR partitioned OR parallel prefix (including nearest neighbor communication) PIM SRAM Can look like conventional 32Kx4 SRAM Or 2Kx64 w/ 64 1b processors On each clock cycle, processor can load or store on memory (not both) load from n/w store to n/w perform abc or ab+ac+bc or c+ab Global Communication global OR OR of 32K unit returned to host in one tick partitioned OR OR of power of 2 fed back in one tick parallel prefix 15 levels level 0 nearest neighbor communication in one tick level 1 even processors i to both i-1 and i-2 level 1 4n processors to 4n-[4,3,2,1] level 15 broadcast Software: dataparallel bit c (dbC) ANSI C superset adds parallel arbitrary-length bit streams ... Will it work? - not COTS (common off-the-shelf) Technology + BW internal to memory chip very high * signal processing? * code breaking? (including nearest neighbor communication) Critique of SIMD ---------------- + saves sequencing - microprocessor are great sequencers + great parallelism - but what of amdahl's law Burroughs Scientific Processor (BSP) (OPTIONAL) ----------------------------------------------- Interesting, in part, because it focuses on dependent, rather than independent, operations (like Multiscalar) FAPAS pipeline F: 16 of 17 memory elements (fetch) A: alignment network (align) P: 16 processor elements (process) A: alignment network (align) S: 16 of 17 memory elements (store) What instructions? Vector instructions with 1-5 operands Simple: Z[i] = A[i] + B[i] 5-op: Z[i] = (A[i] op B[i]) (C[i] op D[i]) op E[i] Reduce: X[i] = A[i,0] op A[i,1] op A[i,2] ... expand, merge, compress Why 17 memory banks? prime number of mitigate conflicts At each address use 16 banks and leave one empty bank address = A div 16 bank number = A mod 17 use recurrence to avoid non-power-of-two mod operation BN = BN + 1 if (BN==17) BN = 0 Example w/ 4+1 Bank Bank Address Number 0 1 2 3 4 0 0 1 2 3 -- 1 5 6 7 -- 4 2 10 11 -- 8 9 3 15 -- 12 13 14