W. Daniel Hillis and Guy L. Steele. Data Parallel Algorithms, Communications of the ACM, December 1986, pp. 1170-1183. ACM DL Link |
Programming styles.
Control parallel style (multiprocessing)
Data Parallel
Connection Machine Architecture
16384/65536 processors with 4k bits of memory each.
Front end processor (von Neumann style) and array of the above proc.
Proc array extends the ISA of the front-end processor.
SIMD.
Processors :
State bit > if an instr will be executed.
Context : set of processors > context flag > statebit.
Immediate values broadcast
ILLIAC IV, Goodyear MPP, Non-Von ICL DAP, CM > data parallel.
pointer-based communication > SEND instr : proc address + memory address. kind of indirect store.
virtual processors > microcoded controller between frontend and array : helps program abstraction from hardware details.
1+ same program for different hardware
2+ processors may be regarded as expandable => (Fortran preallocates all memory), other prog languages > procedure call to allocate a "fresh" processor. (eg : malloc in C can) => processor-cons > memory comes with a processor.
Paper goes on and on about algorithm adaptations to data parallel archs. and yes they can be done.
Coolest conclusion :
When operating on lots of data : The parallelism to be gained by concurrently operating on multiple data elements will be greater than the parallelism to be gained by concurrently executing lines of code.
Review from cs757
“Data parallel algorithms” talks about algorithms that efficiently run on a large array of parallel processors and introduces techniques for parallelising seemingly serial algorithms with large input data. The system considered executes the front end processor’s (VAX or Symobolics 3600) ISA plus floating point arithmetic on large data in parallel (supported by an array of Connection Machine Processors). The introduction of a flexible network is necessary to make programing/compiling easier (hides the architecture) . By using 2 types of instructions, one that executes on all processors (conditionally based on a context) and another that specifies the interconnect, this system retains the advantages of a SIMD albeit multiple processors. The authors give implementations for algorithms that are inherently parallel such as sum of an array of numbers, partial sums on an array, sorting etc., Furthermore, adaptations of serial tasks to make full use of the parallel architecture are explained. For example, parsing of a string by starting with each character in parallel and forming larger tokens by analysing neighbors recursively is explained. In the 80s, processing of pointers was part of most tasks, the paper explains how this can be done over parallel processors.
The paper provides data on run time measured on the 65536 processor Connection Machine system for certain algorithms. The model used assumes unit time for all instructions including data transfer between processors over the interconnect. Also, resource availability is assumed to be infinite. Parallel algorithms can effectively be mapped to the Connection Machine System. The paper tries to prove that many applications that involve a large enough data set can be efficiently mapped to a parallel system by converting it to a binary tree. The key idea is to look at increasing performance by employing one processor per data element.The paper is more of a marketing white paper for the Connection Machine system. Several examples given are efficiently executed in a parallel processor architecture. However a few inherently serial operations are tweaked to fit the parallel architecture. The resulting efficiency is highly questionable, extra complexity is introduced just to find and fit algorithms that run efficiently on this system. Written in 1986, it discusses one of the first vector data kind processors. Unlike current day implementation of large vector processors, several small processors that cycle synchronously execute the same instruction on a different piece of data is introduced. This idea of looking at computing as operating on groups of data rather than sequential execution, by employing a group of simple processors connected by a flexible interconnect network still inspires new research like Explicit Data Graph Execution.