The contents

To complete the term project is one of two major requirements for this course. The task is to design a parallel algorithm solving the problem you have chosen, to write a code in PVM or MPI2, and to implement it on the local SP-2 machine and/or a local cluster of UNIX workstations.

You can choose from the list of problems I am offering bellow, or you can propose your own problem to solve in case you are interested in solving problems closer to your research interests. Such own project proposals will be negotiated case by case.

The collection of problems in the proposed list include mostly problems with known and simple sequential solutions which are relatively easy to parallelize. More specifically, parallelization will have form of a course-grained divide-and-conquer approach when a large input data set is uniformly split among several processors, which will do concurrently some sequential computation, and then will combine the local results into a global solution.

The amount of work of each of proposed projects is meant for one student.

Problems

 Queen problem
Input data:
1. p=the number of processors
2. integer n>=4
Problem:
b The problem is to find at least one position of n queens on an n-by-n chessboard so that no queen beats the others. For >=4, there always exists at least one solution.
Output data:
At least one n-tuple r1,r2,...,rn, where ri = row number of the queen in column i.
Sequential solution:
Standard depth-first search in the state space of all n! possible position of n queens. Obvious optimizations include tricks such as consider only upper half of rows for the first queen (to eliminate at least some obviously symmetric solutions) so that there are only k=n/2 level-1 configurations.
Parallel algorithm:
Assume fully connected network of p processors. Processor P1 distributes in the beginning all the first k=n/2 initial configurations. If k<=p, then the first k processors receive a stack with one initially unexpanded state, if k>p, then each processor receives k/p non-expanded initial states. Since this moment, the algorithm is fully distributed (hence there is no master to coordinate the other processors). Each processor with nonempty stack performs the DFS. Any state cannot be further expanded if there no conflict-free position for the next-column queen and the algorithm backtracks. Processors that have received an empty stack or processors who searched through its subspace without finding a solution, will make effort to get more work (dynamic load balancing scheme) using the following algorithm :

Global round-robin work split algorithm:
Each processor maintains an independent variable r, initially set to myID+1. If idle, it tries to find its donor. It first send a request to processor r and increments the variable r modulo p. The possible target processor may either accept and split its work or refuse. It also must handle the situation of two or more simultaneous requests at the same time.

The algorithm for splitting the work is based simply on sending a part of the stack to the requesting processor.

Whenever a processor finds a solution, it send a message to P-1 and that will terminate the whole computation with a One-to-all broadcast.

Your goal is to investigate the scalability of this problem. For which n can

Further hints:
There is couple of details you will have to solve during implementation, namely the way how to handle the work splitting. Ideally, you should split the work to two halves, but the point is that you do not know the size of the subtree which is to be searched from your stack state. What is easy here is that you can send a message directly to any processor.

 Strassen matrix multiplication
Input data:
1. p= the number of processors, we assume again a fully connected network.
2. Square n× n matrices A and B.
Problem:
To compute C=A× B using parallel Strassen method.
Output data:
n× n matrix C
Sequential solution:
Sequential Strassen multiplication algorithm is asymptotically faster than classical one. It is a recursive 7-ary divide-conquer algorithm: The basic idea is that the matrix is splits to 4 n/2× n/2 submatrices, from which by summing 7 pairs of submatrices is formed that are then multiplied. The recursion goes up to the crosspoint where classical multiplication is approximatively equally complex.
Parallel algorithm:
Obvious parallelization leads to a parallel 7-ary divide-conquer algorithm: One processor holding the matrix splits it to 4 submatrices and forms 7 pairs of submatrices that hands out to 7 slave processes who may continue recursively (the number of processors multiplies by 7 with each level) up to the point when we either run out of processors or we reach the level at which a standard multiplication algorithm is better. I have couple of papers describing some parallel Strassen implementations.
Further hints:
The parallel algorithm is easy to design. The challenge here is to play with tuning the level of parallel recursion depending on p and n. For example, if you have exactly p=4, then a good strategy would be to assign to each processor two pairs of submatrices, keep one pair for itself, and to proceed with sequential Strassen algorithm within processors. If p=8, then it is better to send one pair to one slave and wait for the results.

Another interesting issue here the memory requirements of the algorithm. If you sequentially multiply huge matrices that exceed the memory size, then the machine swaps and the run time increases significantly. If you manage to keep the same matrix distributed within memories of slave processors, then you might get even superlinear speedup since slaves do not swap.

 Fox parallel matrix multiplication
Input data:
1. p=k2= the number of processors, interconnected into 2-D torus T(k,k)
2. Square n× n matrices A and B.
Problem:
To compute C=A× B using parallel Cannon method.
Output data:
n× n matrix C
Sequential solution:
The basic high-school matrix multiplication rows times columns.
Parallel algorithm:
The matrices are split to k× k submatrices which are checkerboard partitioned among processors. We require the resulting matrix C to be partitioned in the same way. Hence, initially, Pi,j owns Ai,j and Bi,j and is to compute Ci,j.

The algorithm proceeds as follows:

1. For all i=1,...,k, processor Pi,i broadcasts the diagonal submatrix Ai,i to all processors in row i of the torus.
2. Every processor Pi,j multiplies submatrix A received by the broadcast with its resident submatrix B.
3. Every processor send its submatrix B up with wraparound and receives a new submatrix B form the processor bellow with wraparound.
4. New submatrix in every row is selected and broadcast rowwise. If Ai,j was the previously selected matrix, then the next selected one is Ai,j+1\mod k.
Further hints:
This algorithm is fairly easy to implement, except that you must keep the processors synchronized across the phases of the algorithm and you are required to simulate the 2-D mesh of processors even though you physical interconnect topology might be different. This implies for example to implement one-to-all broadcast as bidirectional pipeline in a path of processors.

Another interesting issue here the memory requirements of the algorithm. If you sequentially multiply huge matrices that exceed the memory size, then the machine swaps and the run time increases significantly. If you manage to keep the same matrix distributed within memories of slave processors, then you might get even superlinear speedup since slaves do not swap.

 Intersections of line segments
Input data:
1. a,b = the size of a 2-D rectangular area
2. p = the number of processors
3. l = max. length of line segments
4. n = the number of line segments, 10<=n<=50000
5. S = set of randomly generated line segments of length at most l in the a× b rectangular area
Problem:
Compute all pairs of intersecting line segments.
Output data:
A set of triples (line-segment1, line-segment2, intersection point)
Sequential solution:
sort the line segments along axis x and then scan them left-to-right and test pairs of line segments on the intersection. The number of pairs to be tested is at most n2/2. It can be decreased by ending the scan to the right always when the length of the currently tested line segment is smaller than the distance to the next line segment.
Parallel algorithm:
Assume that the processors are connected in a linear array. Generate randomly the line segments so that every processor gets n/p of them. Then sort the line segments by the parallel even-odd transposition along axis x. This establishes the borders between processors in the rectangular area, each is assigned a horizontal slab of it where all its n/p line segments start (they can however have their end-points in the area of another processor, not necessarily the neighboring one). Various processors may have slabs of various widths.

Then processors perform the local intersection tests. Any local line segment which exceeds the local subarea and has its second end-point in the slab of another processor must be sent by the owner to those processors who may be interested. Hence every processor sends a sets of line segments to the right and receives a set of line segments from the left processors. One line segment may lay across several slabs so in general the number of communication steps is data dependent.

Perform the all-to-all reduction at the end when all processors will contribute with their local partial sets of intersection points so that after the reduction, every processor will know all the intersections.

Further hints:
Your algorithm should work for and combination of a,b,l,n. If l will be small, then there nay be no intersection. If l is roughly equal to max(a,b) , then you may have line-segments crossing slabs of nearly all processors.

If possible, use some simple visualization tool to verify in the development stage that your parallel algorithm solves the problem correctly.

 Visibility of vertical lines
Input data:
1. a,b = the size of a 2-D rectangular area
2. p = the number of processors
3. l = maximal length of a line segment, l<=a
4. n = integer 10<=n<=50000;
5. S = a set of n randomly generated vertical line segments of length at most l in 2-D plane a× b
Problem:
Find all pairs of line segments which can see each other horizontally:.
```Initial state:                          Solution:

|                                    |
|    |   |    | 		       |    |...|    |
|    |   |    | 		       |....|   |....|
|        |    | 		       |........|    |
|        |      		       |        |
|      		                |

```
Output data:
The set of all such pairs.
Sequential solution:
Sort the line segment by coordinate x and perform the search from left to right, until you get the situation when no other line segment can be visible, since the previous ones have closed the whole vertical scene.
Parallel algorithm:
Split the input set of line segments uniformly among processor OR generate randomly n/p on each processor. Using a parallel even-odd transposition sort sort the line segments by x. Then each processor will own a bounded slab of the rectangular area, various processors may have slabs of various widths. Processors will solve first subproblems of visibility locally. Then they must determine whether some of their line segment can be visible from the right or left. If so, they send corresponding line segments together to corresponding neighbors and they will proceed similarly. Note that line-segments can be visible across the whole area a× b.
Further hints:
If possible, use some simple visualization tool to verify in the development stage that your parallel algorithm solves the problem correctly.

 All nearest neighbors
Input data:
1. a,b = the size of a 2-D rectangular area
2. p = the number of processors interconnected into a linear array
3. n = the number of line segments, 10<=n<=50000
4. S = set of randomly generated points in the a× b rectangular area
Problem:
For each point, find its closest
Output data:
The set of ordered pairs of closest points.
Sequential solution:
Parallel algorithm:
Assign n/p points to each processor and sort them using a parallel even-odd transposition sort by coordinate x. Each processor will own then a slab of points and calculates the closest neighbor pairs locally. Then it determines which of his local points might have a closest point in the slab of a neighboring processor (processors know the coordinates of slab borders) and sends these points there to find out whether this is really the case. A given point can make several hops before it can be concluded that the closest point was found.
Further hints:
Note that the relation "to be the closest point" is not necessarily symmetric.

Another setting and more interesting of this problem is to have processors forming a 2-D mesh and split the plane into rectangular areas instead of slabs. This is however a bit more difficult to algorithmize.

If possible, use some simple visualization tool to verify in the development stage that your parallel algorithm solves the problem correctly. You may for example draw a line between the closest pairs of points and you will get wonderful night sky simulation with bizarre constellations.

 Other potential problems
There are other possible problems but I have not had time to give a detailed specification yet. If you would be interested in the following problem sets, let me know I will give full specifications.
1. Parallel connected components
2. Parallel solitaire game
3. Parallel convex hull
4. Closest pair problem
5. Circle-cover minimization problem

• The project report due is Wed, May 12, 11.59 a.m.
There are three ways how to submit the report in the following order of preference
1. to email me a postscript file, generated on a linux of unix machine (no MS postscript please). Either *.ps or *.ps.gz.uu file.
2. to bring me the printed report to my ofice.
3. to place the printed report into my mailbox in the mail room.
• Demonstration of your PVM/MPI program.
1. You must demonstrate your program running at a platform of your choise, but your virtual parallel machine should consist of at least two machines.
2. The console machine for the demonstration should be preferably inside the building of CompSci\&Stat.
3. Demonstrations will take place during Thursday, May 13.
4. The exact time will be set up by email. Plan the best time for the demonstration now and send me your propositions not later that Wed, May 5 so that I can prepare a contention-free schedule for Thursday May 13.
5. If this day is not acceptable for you, we can try special arrangement, but preferably before May 13. Also, make sure then to submit me the final report at least one day before the presentation so that I have time to read it before.

Project report skeleton

1. Your final project report should be written in tex or latex.
2. Click here for a latex report skeleton , which lists all sections you might need.
3. The sections in the skeleton may help you to organize the report and to give you idea on evaluation of your program. You do not have to fill all these sections, but the more you do, the more useful the report will be for the evaluation of the project.
4. There are no strict upper or lower bounds on the size of the project report, but preferably it should not be shorter that 4 pages and longer than 30 pages. The rule of thumb is half of a page per section in the skeleton plus all explanatory figures, charts, diagrams or source code listings as appendices.