Section \#2: PRAM models
(CS838: Topics in parallel computing, CS1221, Thu, Jan 21, 1999, 8:00-9:15 a.m.)
Pavel Tvrdik

The contents

RAM model
PRAM model
Handling shared memory access conflicts
Computational power
Simulation of large PRAMs on small PRAMs
Simulation of stronger PRAM models on weaker ones
Conclusion
Content

RAM model

Random Access Machine is a favorite model of a sequential computer. Its main features are:

Computation unit with a user defined program.
Read-only input tape and write-only output tape.
Unbounded number of local memory cells.
Each memory cell is capable of holding an integer of unbounded size.
Instruction set includes operations for moving data between memory cells, comparisons and conditional branches, and simple arithmetic operations.
Execution starts with the first instruction and ends when a HALT instruction is executed.
All operations take unit time regardless of the lengths of operands.
Time complexity = the number of instructions executed.
Space complexity = the number of memory cells accessed.

PRAM model

Parallel Random Access Machine is a straightforward and natural generalization of RAM. It is an idealized model of a shared memory SIMD machine. Its main features are:

Unbounded collection of numbered RAM processors P₀, P₁, P₂,... (without tapes).
Unbounded collection of shared memory cells M[0], M[1], M[2],....
Each P_i has its own (unbounded) local memory (registers) and knows its index i.
Each processor can access any shared memory cell (unless there is an access conflict, see further) in unit time.
Input af a PRAM algorithm consists of n items stored in (usually the first) n shared memory cells.
Output of a PRAM algorithm consists of n' items stored in n' shared memory cells.
PRAM instructions execute in 3-phase cycles.
1. Read (if any) from a shared memory cell.
2. Local computation (if any).
3. Write (if any) to a shared memory cell.
Processors execute these 3-phase PRAM instructions synchronously.
Special assumptions have to be made about R-R and W-W shared memory access conflicts.
The only way processors can exchange data is by writing into and reading from memory cells.
P₀ has a special activation register specifying the maximum index of an active processor. Initially, only P₀ is active, it computes the number of required active processors and loads this register, and then the other corresponding processors start executing their programs.
Computation proceeds until P₀ halts, at which time all other active processors are halted.
Parallel time complexity = the time elapsed for P₀'s computation.
Space complexity = the number of shared memory cells accessed.

PRAM is an attractive and important model for designers of parallel algorithms. Why?

It is natural: the number of operations executed per one cycle on p processors is at most p.
It is strong: any processor can read or write any shared memory cell in unit time.
It is simple: it abstracts from any communication or synchronization overhead, which makes the complexity and correctness analysis of PRAM algorithms easier. Therefore,
It can be used as a benchmark: If a problem has no feasible/efficient solution on PRAM, it has no feasible/efficient solution on any parallel machine.
It is useful: it is an idealization of existing (and nowaday more and more abundant) shared memory parallel machines.

The PRAM corresponds intuitively to the programmers' view of a parallel computer: it ignores lower level architectural constraints, and details, such as memory access contention and overhead, synchronization overhead, interconnection network throughput, connectivity, speed limits and link bandwidths, etc.

Constrained PRAM models

Bounded number of shared memory cells. This may be called a small memory PRAM. If the input data set exceeds the capacity of the shared memory, the input and/or output values can be distributed evenly among the processors.
Bounded size of a machine word and/or memory cell. This parameter is usually called word size of PRAM.
Bounded number of processors. This may be called a small PRAM. If the number of threads of execution is higher, processors may interleave several threads.
Constraints on simultaneous access to shared memory cells: handling access conflicts.

Limiting the amount of PRAM shared memory corresponds to restricting the amount of information that can be communicated between processors in one step. For example, a distributed memory machine with processors interconnected by a shared bus can be modeled as a PRAM with one shared memory cell.

Back to the beginning of the page

Back to the CS838 class schedule

Handling shared memory access conflicts

To make the PRAM model realistic and useful, some mechanism has to be defined to resolve read and write access conflicts to the same shared memory cell.

Exclusive Read Exclusive Write (EREW) PRAM: No two processors are allowed to read or write the same shared memory cell simultaneously.
Concurrent Read Exclusive Write (CREW) PRAM: Simultaneous reads of the same memory cell are allowed, but only one processor may attempt to write to an individual cell
Concurrent Read Concurrent Write (CRCW) PRAM: Both simultaneous reads and both simultaneous writes of the same memory cell are allowed.

Concurrent Read has a clear semantics, whereas Concurrent Write has to be further constrained. There exist several basic submodels:

PRIORITY CRCW: the processors are assigned fixed distinct priorities and the processor with the highest priority is allowed to complete WRITE.
ARBITRARY CRCW: one randomly chosen processor is allowed to complete WRITE. The algorithm may make no assumptions about which processor was chosen.
COMMON CRCW: all processors are allowed to complete WRITE iff all the values to be written are equal. Any algorithm for this model has to make sure that this condition is satisfied. If not, the algorithm is illegal and the machine state will be undefined.

The following example demonstrates the differences among submodels.

Example

Assume p-processor PRAM, p<n. Assume that shared memory contains n distinct items and P₀ owns value x. The task is to let P₀ know whether x occurs within the input array.

EREW PRAM algorithm:
1. P₀ broadcasts x to P₁,...,P_p in log p steps using binary broadcast tree.
2. All processors perform local searches, each on [ n/p] items in [ n/p] steps.
3. Every processor defines a flag Found and all processors perform a parallel reduction.
```
T(n,p)=log p + n/p
```
CREW PRAM algorithm: A similar approach, but P₁,...,P_p can read x simultaneously in O(1) time. But the final reduction takes O(log p) time anyway, so
```
T(n,p)=log p + n/p
```
COMMON CRCW PRAM algorithm: The final step takes now also O(1) time, those processors with the flag Found set can write simultaneously into P₀'s cell
```
T(n,p)=n/p.
```

Back to the beginning of the page

Back to the CS838 class schedule

Computational power

Having this range of submodels, we must ask how they compare as to the ability to execute parallel algorithm. Various submodels may have different computational power.

Definition

PRAM submodel A is computationally stronger than submodel B, written A>=B, if any algorithm written for B will run unchanged on A in the same parallel time, assuming the same basic properties.

Lemma

PRIORITY >= ARBITRARY >= COMMON >= CREW >= EREW

Back to the beginning of the page

Back to the CS838 class schedule

Simulation of large PRAMs on small PRAMs

Small PRAMs can simulate larger PRAMs. Even though relatively simple, the following two simulations are very useful and notoriously used. The first result says that if we decrease the number of processors, the cost of a PRAM algorithm does not change, up to a multiplicative constant.

Lemma

Assume p'<p. Any problem that can be solved on a p-processor PRAM in t steps can be solved on a p'-processor PRAM in t'=O(tp/p') steps assuming the same size of shared memory.

Proof:

Partition p simulated processors into p' groups of size p/p' each.
Associate each of the p' simulating processors with one of these groups.
Each of the simulating processors simulates one step of its group of processors by:
1. executing all their READ and local computation substeps first,
2. executing their WRITE substeps then. \sq

This result has an important consequence!!! If we develop a parallel PRAM algorithm with C(n,p)=o(SU(n)), we have automatically developed a better sequential algorithm.

Lemma

Assume m'<m. Any problem that can be solved on a p-processor and m-cell PRAM in t steps can be solved on a max(p,m')-processor m'-cell PRAM in O(tm/m') steps.

Proof:

Partition m simulated shared memory cells into m' continuous segments S_i of size m/m'.
Each simulating processor P'_i, 1<= i<= p, will simulate processor P_i of the original PRAM.
Each simulating processor P'_i, 1<= i<= m', stores the initial contents of S_i into its local memory and will use M'[i] as an auxiliary memory cell for simulation of accesses to cells of S_i.
Simulation of one original READ operation:
each P'_i, i=1,...,max(p,m') repeats for k=1,...,m/m':
1. write the value of the k-th cell of S_i into M'[i], i=1,...,m',
2. read the value which the simulated processor P_i, i=1,...,p, would read in this simulated substep, if it appeared in the shared memory.
The local computation substep of P_i, i=1,...,p, is simulated in one step by P'_i.
Simulation of one original WRITE operation is analogous to that of READ.

Back to the beginning of the page

Back to the CS838 class schedule

Simulation of stronger PRAM models on weaker ones

It is very useful to know efficient simulations of stronger PRAM models on weaker ones, since a stronger model is more convenient for the design of algorithms, whereas weaker models, such as EREW, are closer to real parallel computers. Since it is technologically difficult to build full massively parallel CREW or CRCW PRAM computers, it is important to understand the costs of simulating the CREW or CRCW machines on EREW. Any multiple access has to be converted into a series of exclusive accesses. The most important are simulations of the strongest PRIORITY CRCW on the weakest EREW.

Lemma

Assume PRIORITY CRCW with the priority scheme based trivially on indexing: lower indexed processors have higher priority. One step of p-processor m-cell PRIORITY CRCW can be simulated by a p-processor mp-cell EREW PRAM in O(log p) steps.

Proof:(Constructive.)

Each PRIORITY processor P_k is simulated by EREW processor P'_k.
Each shared memory cell M[i], i=1,...,m, of PRIORITY is simulated by an array of p shared memory cells M'[i,k], k=1,...,p, in the EREW PRAM. M'[i,1] plays the role of M[i]. M'[i,2],...,M'[i,p] are auxiliary cells used for resolving conflicts, initially empty, and organized as internal nodes of a complete binary tree T_i with p leaves. The height of every T_i is [log p].
Simulation of a PRIORITY WRITE substep: Each EREW processor must find out whether it is the processor with the lowest index within the group of processors asking to write to the same cell, and if so, it must become the group winner and perform the WRITE operation. The other processors in the group will fail. This is done as follows:
1. If P_k wants to write into M[i], processor P'_k turns active and becomes k-th leaf of T_i. It knows whether it is the right or left child of its parent.
2. Each active left processor stores its ID into the parent cell in its tree, marks it as occupied, and remains active.
3. Each active right processor checks its parent cell. If this is empty, it stores its ID into it, and remains active. If it is occupied, it becomes idle (he lost).
4. This is repeated log p times at further levels of the trees.
5. The processor who managed to proceed to the root of T_i, becomes the winner who can write into M[i]. Processors which used T_i must then sweep down T_i in the reverse order to reset the cells M'[i,2],...,M'[i,p] to empty.
Simulation of a PRIORITY READ substep: is similar.
1. The same sweep-ups of the auxiliary trees are performed in parallel to determine the winners in the groups.
2. The winners will read the values from the cells M'[*,1]
3. During the cleaning sweep-down, the read value is distributed to the losers.

Example

p=7, processors P₂, P₃, and P₇ wish to write into M[5].

Lemma

One step of PRIORITY CRCW with p processors and m shared memory cells can be simulated by an EREW PRAM in O(log p) steps with p processors and m+p shared memory cells.

Proof:(Constructive.)

Each PRIORITY processor P_k is simulated by EREW processor P'_k.
Each PRIORITY cell M[i] is simulated by EREW cell M[i].
EREW uses an auxiliary array A of p cells.
If P_k wants to access M[i], processor P'_k writes pair (i,k) into A[k].
If P_k does not want to access any PRIORITY cell, processor P'_k writes (0,k) into A[k].
All p processors sort the array A into lexicographic order using (log p)-time parallel sort.

Each P'_k appends to cell A[k] a flag f:

f=0, if the first component of A[k] is either 0
     or it is the same as the first component of A[k-1]

f=1 otherwise.

Further steps differ for simulation of WRITE or READ.
PRIORITY WRITE:
1. Each P'_k reads the triple (i,j,f) from cell A[k] and writes it into A[j].
2. Each P'_k reads the triple (i,k,f) from cell A[k] and writes into M[i] iff f=1.
PRIORITY READ:
1. Each P'_k reads the triple (i,j,f) from cell A[k].
2. If f=1, it reads value v_i from M[i] and overwrites the third component in A[k]the flag f, with v_i.
3. In at most log p steps, this third component is then distributed in subsequent cells of A until it reaches either the end or an element with a different first component.
4. Each P'_k reads the triple (i,j,v_i) from cell A[k] and writes it into A[j].
5. Each P'_k who asked for a READ reads the value v_i from the triple (i,k,v) in cell A[k].

Example

Assume p=7, m=4, and

P₁ wants to access M[2],

P₂ wants to access M[4],

P₃ wants to access M[2],

P₄ wants to access M[1],

P₅ wants to access M[4],

P₆ wants to access M[2],

P₇ wants to access no cell at all.

Array A in the first three steps of simualtion:

(2,1, )	(4,2, )	(2,3, )	(1,4, )	(4,5, )	(2,6, )	(0,7, )
(0,7, )	(1,4, )	(2,1, )	(2,3, )	(2,6, )	(4,2, )	(4,5, )
(0,7,0)	(1,4,1)	(2,1,1)	(2,3,0)	(2,6,0)	(4,2,1)	(4,5,0)

Array A in simulation of WRITE:

(2,1,1)

(4,2,1)

(2,3,0)

(1,4,1)

(4,5,0)

(2,6,0)

(0,7,0)

Array A in simulation of READ:

(0,7,0)	(1,4,v₁)	(2,1,v₂)	(2,3,0)	(2,6,0)	(4,2,v₄)	(4,5,0)
(0,7,0)	(1,4,v₁)	(2,1,v₂)	(2,3,v₂)	(2,6,0)	(4,2,v₄)	(4,5,v₄)
(0,7,0)	(1,4,v₁)	(2,1,v₂)	(2,3,v₂)	(2,6,v₂)	(4,2,v₄)	(4,5,v₄)
(2,1,v₂)	(4,2,v₄)	(2,3,v₂)	(1,4,v₁)	(4,5,v₄)	(2,6,v₂)	(0,7,0)

Back to the beginning of the page

Back to the CS838 class schedule

Conclusion

Any polylog-time PRAM algorithm is robust with respect to PRAM models.

Back to the beginning of the page

Back to the CS838 class schedule

Section \#2: PRAM models (CS838: Topics in parallel computing, CS1221, Thu, Jan 21, 1999, 8:00-9:15 a.m.) Pavel Tvrdik

The contents

RAM model

PRAM model

Constrained PRAM models

Handling shared memory access conflicts

Example

Computational power

Definition

Lemma

Simulation of large PRAMs on small PRAMs

Lemma

Lemma

Simulation of stronger PRAM models on weaker ones

Lemma

Example

Lemma

Example

Conclusion

Last modified: Fri Jan 23 by Pavel Tvrdik

Section \#2: PRAM models
(CS838: Topics in parallel computing, CS1221, Thu, Jan 21, 1999, 8:00-9:15 a.m.)
Pavel Tvrdik