Home page for term projects of CS 8381
Spring 1999
The contents
 Specification of problems
 PVM and xPVM pages
 Deadlines and guidelines
 Project report skeleton
To complete the term project is one of two major requirements for this
course. The task is to design a parallel algorithm solving the problem
you have chosen, to write a code in PVM or MPI2, and to implement it
on the local SP2 machine and/or a local cluster of UNIX
workstations.
You can choose from the list of problems I am offering bellow, or you
can propose your own problem to solve in case you are interested in
solving problems closer to your research interests. Such own project
proposals will be negotiated case by case.
The collection of problems in the proposed list include mostly problems
with known and simple sequential solutions which are relatively easy
to parallelize. More specifically, parallelization will have form of
a coursegrained divideandconquer approach when a large input
data set is uniformly split among several processors, which will do
concurrently some sequential computation, and then will combine the
local results into a global solution.
The amount of work of each of proposed projects is meant for one
student.
Problems
 Queen problem
 Strassen matrix multiplication
 Fox parallel matrix multiplication
 Intersections of line segments
 Visibility of vertical lines
 All nearest neighbors
 Other potential problems
 Input data:

 p=the number of processors
 integer n>=4
 Problem:
 b
The problem is to find at least one position of n queens on an nbyn chessboard so that no queen beats the others. For >=4, there always exists at least one solution.
 Output data:

At least one ntuple r_{1},r_{2},...,r_{n}, where r_{i} = row number of the queen in column i.
 Sequential solution:

Standard depthfirst search in the state space of all n! possible position of n queens. Obvious optimizations include tricks such as consider only upper half of rows for the first queen (to eliminate at least some obviously symmetric solutions) so that there are only k=n/2 level1 configurations.
 Parallel algorithm:

Assume fully connected network of p processors. Processor P_{1} distributes in the beginning all the first k=n/2 initial configurations. If k<=p, then the first k processors receive a stack with one initially unexpanded state, if k>p, then each processor receives k/p nonexpanded initial states. Since this moment, the algorithm is fully distributed (hence there is no master to coordinate the other processors). Each processor with nonempty stack performs the DFS. Any state cannot be further expanded if there no conflictfree position for the nextcolumn queen and the algorithm backtracks. Processors that have received an empty stack or processors who searched through its subspace without finding a solution, will make effort to get more work (dynamic load balancing scheme) using the following algorithm :
Global roundrobin work split algorithm:
Each processor maintains an independent variable r, initially set to myID+1. If idle, it tries to find its donor. It first send a request to processor r and increments the variable r modulo p. The possible target processor may either accept and split its work or refuse. It also must handle the situation of two or more simultaneous requests at the same time.
The algorithm for splitting the work is based simply on sending a part of the stack to the requesting processor.
Whenever a processor finds a solution, it send a message to P1 and that will terminate the whole computation with a Onetoall broadcast.
Your goal is to investigate the scalability of this problem. For which n can
 Further hints:

There is couple of details you will have to solve during implementation, namely the way how to handle the work splitting. Ideally, you should split the work to two halves, but the point is that you do not know the size of the subtree which is to be searched from your stack state. What is easy here is that you can send a message directly to any processor.
Strassen matrix multiplication 
 Input data:

 p= the number of processors, we assume again a fully connected network.
 Square n× n matrices A and B.
 Problem:

To compute C=A× B using parallel Strassen method.
 Output data:

n× n matrix C
 Sequential solution:

Sequential Strassen multiplication algorithm is asymptotically faster than classical one.
It is a recursive 7ary divideconquer algorithm: The basic idea is that the matrix is splits to 4 n/2× n/2 submatrices, from which by summing 7 pairs of submatrices is formed that are then multiplied. The recursion goes up to the crosspoint where classical multiplication is approximatively equally complex.
 Parallel algorithm:

Obvious parallelization leads to a parallel 7ary divideconquer algorithm: One
processor holding the matrix splits it to 4 submatrices and forms 7
pairs of submatrices that hands out to 7 slave processes who may continue recursively (the number of processors multiplies by 7 with each level) up to the point when we either run out of processors or we reach the level at which a standard multiplication algorithm is better. I have couple of papers describing some parallel Strassen implementations.
 Further hints:

The parallel algorithm is easy to design. The challenge here is to play with tuning the level of parallel recursion depending on p and n. For example, if you have exactly p=4, then a good strategy would be to assign to each processor two pairs of submatrices, keep one pair for itself, and to proceed with sequential Strassen algorithm within processors. If p=8, then it is better to send one pair to one slave and wait for the results.
Another interesting issue here the memory requirements of the algorithm. If you sequentially multiply huge matrices that exceed the memory size, then the machine swaps and the run time increases significantly. If you manage to keep the same matrix distributed within memories of slave processors, then you might get even superlinear speedup since slaves do not swap.
Fox parallel matrix multiplication 
 Input data:

 p=k^{2}= the number of processors, interconnected into 2D torus T(k,k)
 Square n× n matrices A and B.
 Problem:

To compute C=A× B using parallel Cannon method.
 Output data:

n× n matrix C
 Sequential solution:

The basic highschool matrix multiplication rows times columns.
 Parallel algorithm:

The matrices are split to k× k submatrices which are checkerboard partitioned among processors. We require the resulting matrix C to be partitioned in the same way. Hence, initially, P_{i,j} owns A_{i,j} and B_{i,j} and is to compute C_{i,j}.
The algorithm proceeds as follows:
 For all i=1,...,k, processor P_{i,i} broadcasts the diagonal submatrix A_{i,i} to all processors in row i of the torus.
 Every processor P_{i,j} multiplies submatrix A received by the broadcast with its resident submatrix B.
 Every processor send its submatrix B up with wraparound and receives a new submatrix B form the processor bellow with wraparound.
 New submatrix in every row is selected and broadcast rowwise. If A_{i,j} was the previously selected matrix, then the next selected one is A_{i,j+1\mod k}.

 Further hints:

This algorithm is fairly easy to implement, except that you must keep the processors synchronized across the phases of the algorithm and you are required to simulate the 2D mesh of processors even though you physical interconnect topology might be different. This implies for example to implement onetoall broadcast as bidirectional pipeline in a path of processors.
Another interesting issue here the memory requirements of the algorithm. If you sequentially multiply huge matrices that exceed the memory size, then the machine swaps and the run time increases significantly. If you manage to keep the same matrix distributed within memories of slave processors, then you might get even superlinear speedup since slaves do not swap.
Intersections of line segments 
 Input data:

 a,b = the size of a 2D rectangular area
 p = the number of processors
 l = max. length of line segments
 n = the number of line segments, 10<=n<=50000
 S = set of randomly generated line segments of length at most l in the a× b rectangular area
 Problem:

Compute all pairs of intersecting line segments.
 Output data:

A set of triples (linesegment_{1}, linesegment_{2}, intersection point)
 Sequential solution:

sort the line segments along axis x and then scan them lefttoright and test pairs of line segments on the intersection. The number of pairs to be tested is at most n^{2}/2. It can be decreased by ending the scan to the right always when the length of the currently tested line segment is smaller than the distance to the next line segment.
 Parallel algorithm:

Assume that the processors are connected in a linear array. Generate randomly the line segments so that every processor gets n/p of them. Then sort the line segments by the parallel evenodd transposition along axis x. This establishes the borders between processors in the rectangular area, each is assigned a horizontal slab of it where all its n/p line segments start (they can however have their endpoints in the area of another processor, not necessarily the neighboring one). Various processors may have slabs of various widths.
Then processors perform the local intersection tests. Any local line segment which exceeds the local subarea and has its second endpoint in the slab of another processor must be sent by the owner to those processors who may be interested. Hence every processor sends a sets of line segments to the right and receives a set of line segments from the left processors. One line segment may lay across several slabs so in general the number of communication steps is data dependent.
Perform the alltoall reduction at the end when all processors will contribute with their local partial sets of intersection points so that after the reduction, every processor will know all the intersections.
 Further hints:

Your algorithm should work for and combination of a,b,l,n. If l will be small, then there nay be no intersection. If l is roughly equal to max(a,b) , then you may have linesegments crossing slabs of nearly all processors.
If possible, use some simple visualization tool to verify in the development stage that your parallel algorithm solves the problem correctly.
Visibility of vertical lines 
 Input data:

 a,b = the size of a 2D rectangular area
 p = the number of processors
 l = maximal length of a line segment, l<=a
 n = integer 10<=n<=50000;
 S = a set of n randomly generated vertical line segments of length at most l in 2D plane a× b
 Problem:

Find all pairs of line segments which can see each other horizontally:.
Initial state: Solution:
 
     ... 
    .... ....
   ........ 
   
 
 Output data:

The set of all such pairs.
 Sequential solution:

Sort the line segment by coordinate x and perform the search from left to right, until you get the situation when no other line segment can be visible, since the previous ones have closed the whole vertical scene.
 Parallel algorithm:

Split the input set of line segments uniformly among processor OR generate randomly n/p on each processor. Using a parallel evenodd transposition sort sort the line segments by x. Then each processor will own a bounded slab of the rectangular area, various processors may have slabs of various widths. Processors will solve first subproblems of visibility locally. Then they must determine whether some of their line segment can be visible from the right or left. If so, they send corresponding line segments together to corresponding neighbors and they will proceed similarly. Note that linesegments can be visible across the whole area a× b.
 Further hints:

If possible, use some simple visualization tool to verify in the development stage that your parallel algorithm solves the problem correctly.
 Input data:

 a,b = the size of a 2D rectangular area
 p = the number of processors interconnected into a linear array
 n = the number of line segments, 10<=n<=50000
 S = set of randomly generated points in the a× b rectangular area
 Problem:

For each point, find its closest
 Output data:

The set of ordered pairs of closest points.
 Sequential solution:

 Parallel algorithm:

Assign n/p points to each processor and sort them using a parallel evenodd transposition sort by coordinate x. Each processor will own then a slab of points and calculates the closest neighbor pairs locally. Then it determines which of his local points might have a closest point in the slab of a neighboring processor (processors know the coordinates of slab borders) and sends these points there to find out whether this is really the case. A given point can make several hops before it can be concluded that the closest point was found.
 Further hints:

Note that the relation "to be the closest point" is not necessarily symmetric.
Another setting and more interesting of this problem is to have processors forming a 2D mesh and split the plane into rectangular areas instead of slabs. This is however a bit more difficult to algorithmize.
If possible, use some simple visualization tool to verify in the development stage that your parallel algorithm solves the problem correctly. You may for example draw a line between the closest pairs of points and you will get wonderful night sky simulation with bizarre constellations.
There are other possible problems but I have not had time to give a detailed specification yet. If you would be interested in the following problem sets, let me know I will give full specifications.
 Parallel connected components
 Parallel solitaire game
 Parallel convex hull
 Closest pair problem
 Circlecover minimization problem
Deadlines and guidelines
 The project report due is Wed, May 12, 11.59 a.m.
There are three ways how to submit the report in the following order of preference
 to email me a postscript file, generated on a linux of unix machine (no MS postscript please). Either *.ps or *.ps.gz.uu file.
 to bring me the printed report to my ofice.
 to place the printed report into my mailbox in the mail room.
 Demonstration of your PVM/MPI program.
 You must demonstrate your program running at a platform of your choise, but your virtual parallel machine should consist of at least two machines.
 The console machine for the demonstration should be preferably inside the building of CompSci\&Stat.
 Demonstrations will take place during Thursday, May 13.
 The exact time will be set up by email. Plan the best time for the demonstration now and send me your propositions not later that Wed, May 5 so that I can prepare a contentionfree schedule for Thursday May 13.
 If this day is not acceptable for you, we can try special arrangement, but preferably before May 13. Also, make sure then to submit me the final report at least one day before the presentation so that I have time to read it before.
Project report skeleton
 Your final project report should be written in tex or latex.
 Click here for a latex report skeleton ,
which lists all sections you might need.
 The sections in the skeleton may help you to organize the report
and to give you idea on evaluation of your program. You do not have to
fill all these sections, but the more you do, the more useful the
report will be for the evaluation of the project.
 There are no strict upper or lower bounds on the size of the
project report, but preferably it should not be shorter that 4 pages
and longer than 30 pages. The rule of thumb is half of a page per
section in the skeleton plus all explanatory figures, charts, diagrams
or source code listings as appendices.