---------------------------------------------------------------------
                    CS 577 (Intro to Algorithms)

Lecture notes: Selection                                Shuchi Chawla
---------------------------------------------------------------------

(This document is best viewed with a fixed width font.)


Linear-time Selection
=====================

Some times instead of sorting an entire list of elements, we may only
be interested in (say) what the smallest element is, or what the 10th
smallest element is, and so on. We can easily find the smallest
element, even the 10th smallest element, in a list in O(n) time, by
just scanning the list and keeping track of the smallest (10 smallest)
elements so far. This is much faster than first spending O(n log n)
time sorting the list and then looking up the required element in the
sorted list. Can this be done for every k-th smallest element (for
arbitrary k)? For example, suppose we were to find the median of a
list of elements, can we do this in linear time? Keeping track of the
n/2 smallest elements would take no less than O(n log n) time in
fact. (Think about why this is so.) This is called the selection problem.

In class we saw how to apply a divide and conquer approach similar to
quicksort to solve this problem. (See also Section 13.5 in the book.)
The algorithm works as follows:

Select(L,k)
--------
1. Pick a pivot p
2. Partition the list into L< (those elements smaller than p), p, and
    L> (those elements larger than p).
3. If |L<| >= k, return Select(L<, k).
4. If |L<| = k-1, return p.
5. If |L<| <= k-2, return Select(L>, k-|L<|-1)

Let's try out an example. Suppose that we pivot the list using the
first element. For example, if our list was (5,20,13,2,7,14,8,1,10),
then picking 5 as the pivot would divide the list into (2,1) and
(20,13,7,14,8,10). Now, if we are interested in finding the 5th
smallest element in the original list (the median), then, given that
the first part is of length 2, we know that the required element lies
in the second part and is, in particular, the 2nd smallest element in
that part. Now we can recursively find this 2nd smallest element in
the second part.

How long does this procedure take? It takes n time to divide the list
into two parts, and then at most T(m) time in the recursive call,
where m is the size of the longer of the two parts. If we pick the
pivot uniformly at random as in Quicksort, then the expected value of
m is no more than 3/4 times n. This lets us argue that the size of the
list decreases geometrically as we go down the recursion tree.

Now what if we wanted to run this algorithm without using any
randomness? The tricky part is in picking a good pivot in the first
step so that we get a roughly even split of the list L. In other
words, the pivot p should be a "central" element in the list.

In particular, if we pick an "extreme" element as pivot, then in the
worst case m would be n-1. Our recurrence would then look like T(n) =
n + T(n-1). This solves to T(n) = O(n^2). Ideally, it would be great
if we could pick a pivot for which m = n/2. Then, our recurrence would
look like T(n) = n + T(n/2). You should verify that this indeed solves
to T(n) = O(n). 

So our goal now is to pick a pivot for which the size of the longer
part m is n/2. But such an element is the median of the list by
definition -- this is the same problem that we are trying to solve in
the first place! We will fix this issue by allowing ourselves a little
leeway -- instead of using the median as a pivot, we will pick a pivot
that partitions the list into *roughly* equal parts, with m being
7n/10. This is what we call a central pivot.

How do we find a central pivot? Here is an idea that was brought up in
class: suppose we pick 3 elements from the list and then pick their
median as the pivot. Well then we can be sure than the pivot is not
the largest of the smallest element. And if we picked the three
elements uniformly at random, we get a more central pivot than if we
just picked one element at random and used that as pivot.

We will now expand on this idea but avoid using any randomness. This
approach and the algorithm below was invented by Blum, Floyd, Pratt,
Rivest and Tarjan in 1972.

Here is how we pick the pivot. We partition the list into n/5 groups
of 5 elements each. We then find the median in each group. Note that
this is a constant time operation per group since each group is of
constant size. This gives us a list of n/5 medians. We now proceed to
recursively find the median of this list of medians. Call is p. We
then use this median of medians p to pivot the original list.

Before we analyze the time complexity of this algorithm, we claim that
we pick a good pivot -- m is at most 7n/10.

Claim: There are at least 3n/10 elements larger than and at least
3n/10 elements smaller than the median of medians p picked above.

Proof: p is a median of n/5 medians. Therefore, at least n/10 of the
medians (including itself) are less than or equal to p. Since each of
these is a median of 5 elements, each of these medians accounts for a
total of 3 elements (including itself) that are smaller than p.
Therefore, a total of at least 3n/10 elements are smaller than p. The
same argument can be made for elements larger than p.


Finally, let us consider the running time of this algorithm. We take
O(n) time to find the medians of n/5 sublists. We then take T(n/5)
time to recursively find a median of medians. This gives us a
pivot. We then take O(n) time to partition the list using this
pivot. Finally, we take T(7n/10) time to recursively solve the
selection problem over one of the partitions of the list.

Therefore, we have T(n) = T(n/5) + T(7n/10) + cn for some constant c.

There are several ways of solving this recurrence. Here, we use the
"recursion tree" method. At the top level, we spend cn time. At the
next level, we spend cn/5 time for the first recursive call, and
7cn/10 time for the second recursive call, which gives us a total of
9cn/10 time. Likewise, at the next level, we spend 81cn/100 time, and
so on. Our total time spent looks like

T(n) = cn + 9cn/10 + 81cn/100 + 729cn/1000 + ...

Note that this is a converging series with each term a multiplicative
factor smaller than the last. Even if this process continues forever,
the total time taken is dominated by the first term and is therefore
O(n).

This completes our analysis of the algorithm.

Note that in analyzing the running time of the above algorithm, one
crucial property that we used was that n/5 + 7n/10 < n. That is, the
sum of the sizes of the subproblems we solved recursively was strictly
smaller than the size of the original problem. This turned out to be
important because it implied that the running time of each recursive
step would be a constant factor smaller than that of the previous
level. Contrast this to the size of the subproblems we obtain for
mergesort. This also hints at a metatheorem for recurrences:

For constants c and a1, a2, ..., ak such that a1 + ... + ak < 1, the
recurrence T(n) = cn + T(a1 n) + T(a2 n) + ... + T(ak n) solves to
T(n) = Theta(n).

You should prove this theorem as an exercise.