---------------------------------------------------------------------
                    CS 577 (Intro to Algorithms)

Lecture notes: Sorting and selection                    Shuchi Chawla
---------------------------------------------------------------------


1. Lower bound on Comparison-based Sorting
   =======================================

We saw in class that the mergesort algorithm for sorting takes time
O(n log n) on lists of size n. There are in fact several sorting
algorithms that achieve this same time bound, for example, quicksort
and heapsort. Is it possible to do any better?

We will now see that any algorithm that is only allowed to compare
pairs of elements in the list must take time at least Omega(n log n)
for sorting the list in the worst case. Amazingly, this statement
holds for *all* algorithms, those that we know and those that we don't
know, indeed also those that have not been invented as yet. (Note
that one can in fact construct better algorithms by exploiting
special properties of the data, for example, if all of the elements
are "small" integers.)

How can we go about proving this? Note that we cannot assume anything
about the algorithm, such as that it splits the input as in mergesort
or quicksort. Instead, we will think of the algorithm as playing a
game of 20 questions. 

Precisely, let's think of a comparison based algorithm as a decision
tree. Let n be fixed and suppose that the algorithm starts by
comparing (say) the first element in the list with the tenth
element. The root of the decision tree is labeled by this pair -- 1
and 10. Assuming that all the elements are distinct (this really
doesn't effect the analysis much), there are just two possible ways in
which this comparison can turn out: element 1 is either larger or
smaller than element 10. The root therefore has two children, each
representing the future actions of the algorithm depending on what the
comparison turns out to be. Likewise, at the left child of the root,
the algorithm makes a comparison between some pair of elements. The
left and right subtrees of this node represent the future actions of
the algorithm depending on what the comparison turns out to be. The
leaves of the decision tree represent the outcome of the algorithm --
after a number of comparisons, the algorithm decides what the right
ordering of the elements should be.

For example, suppose that the list contains three elements A, B and C,
then the following is a valid algorithm for sorting the list,
displayed in the form of a decision tree.

                       A>B?
            Yes                     No
            A>C?                   B>C?
        Yes      No             Yes     No
       B>C?      CAB           A>C?     CBA
    Yes    No               Yes    No
   ABC      ACB            BAC      BCA

This tree has 6 leaves, one for each possible ordering of the three
elements.

In order for an algorithm to be correct, the corresponding decision
tree must have at least n! leaves, one for each possible ordering of
the list. The worst case time complexity of the algorithm is simply
the longest possible number of comparisons it makes on any list. In
terms of the corresponding decision tree, this is the longest path
from the root to any leaf, or the height (depth) of the tree. A binary
tree with n! leaves must have a height of at least log n!. Therefore,
we get a log n! lower bound on the running time of any
comparison-based sorting algorithm.

To get a better handle on the expression log n!, note that 
n! < n^n, but also, n! > (n/2)^(n/2)

So, log n! > n/2 log (n/2) = Omega(n log n)


----------------------------------------------------------------------


2. Linear-time Selection
   =====================

Some times instead of sorting an entire list of elements, we may only
be interested in (say) what the smallest element is, or what the 10th
smallest element is, and so on. We can easily find the smallest
element, even the 10th smallest element, in a list in O(n) time, by
just scanning the list and keeping track of the smallest (10 smallest)
elements so far. This is much faster than first spending O(n log n)
time sorting the list and then looking up the required element in the
sorted list. Can this be done for every k-th smallest element (for
arbitrary k)? For example, suppose we were to find the median of a
list of elements, can we do this in linear time? Keeping track of the
n/2 smallest elements no longer works, and takes O(n^2) time in fact.
This is called the selection problem. Although the algorithm we
describe below works for any k, we focus on the special case of
finding a median. Also, to avoid confusion, the median of an
odd-length list is defined to be the (n+1)/2th smallest element,
whereas that of an even-length list is defined to be the n/2th
smallest element.

Let us try to apply divide and conquer to this problem in the same way
as we did for mergesort. Suppose that we divide the list into two
equal halves and find the median in each half in some T(n/2)
time. Given these medians how do we find the median of the entire
list? If these medians were in fact equal, then the same value would
also be the median of the entire list. But, if they are not equal, say
the first is smaller than the second, then the true median lies among
elements in the first half that are larger than the first median and
elements in the second half that are smaller than the second
median. Now we could recursively proceed to find the median on this
smaller list of elements. But this "combine" step at best takes O(n)
time. So our overall running time would (at best) be given by an
equation of the form T(n) = 2 T(n/2) + n. This again leads us to an
O(n log n) running time, and we are doing no better than sorting.

We would be able to get some improvement if instead of recursing in
both the halves we only have to recurse in one of them. This leads us
to our second idea based on the quicksort algorithm. The quicksort
algorithm sorts a given list of elements by first picking an
appropriate "pivot". It then divides the list into two parts -- one
consisting of all elements smaller than the pivot and the other
consisting of all elements larger than the pivot. It then recursively
sorts these parts and outputs the sorted first part followed by the
pivot followed by the sorted second part. We will analyze a randomized
version of this algorithm later on in the course.

For now, let us try to apply this pivoting idea to the selection
problem. Suppose that we pivot the list using the first element. For
example, if our list was (5,20,13,2,7,14,8,1,10), then picking 5 as
the pivot would divide the list into (2,1) and (20,13,7,14,8,10). Now,
if we are interested in finding the 5th smallest element in the
original list (the median), then, given that the first part is of
length 2, we know that the required element lies in the second part
and is, in particular, the 2nd smallest element in that part. Now we
can recursively find this 2nd smallest element in the second part.

How long does this procedure take? It takes n time to divide the list
into two parts, and then T(m) time in the recursive call, where m is
the size of the longer of the two parts. What is a good estimate on
this value m?  Well, in the worst case, we may pick the smallest
element in the list as the pivot, in which case m would be n-1. Our
recurrence would then look like T(n) = n + T(n-1). This solves to T(n)
= O(n^2). So we have not really made any progress. Ideally, it would
be great if we could pick a pivot for which m = n/2. Then, our
recurrence would look like T(n) = n + T(n/2). You should verify that
this indeed solves to T(n) = O(n).

So our goal now is to pick a pivot for which the size of the longer
part m is n/2. But such an element is the median of the list by
definition -- this is the same problem that we are trying to solve in
the first place! We will fix this issue by allowing ourselves a little
leeway -- instead of using the median as a pivot, we will pick a pivot
that partitions the list into *roughly* equal parts, with m being
7n/10. This idea and the algorithm below was invented by Blum, Floyd,
Pratt, Rivest and Tarjan in 1972.

Here is how we pick the pivot. We partition the list into n/5 groups
of 5 elements each. We then find the median in each group. Note that
this is a constant time operation per group since each group is of
constant size. This gives us a list of n/5 medians. We now proceed to
recursively find the median of this list of medians. Call is p. We
then use this median of medians p to pivot the original list.

Before we analyze the time complexity of this algorithm, we claim that
we pick a good pivot -- m is at most 7n/10.

Claim: There are at least 3n/10 elements larger than and at least
3n/10 elements smaller than the median of medians p picked above.

Proof: p is a median of n/5 medians. Therefore, at least n/10 of the
medians (including itself) are less than or equal to p. Since each of
these is a median of 5 elements, each of these medians accounts for a
total of 3 elements (including itself) that are smaller than p.
Therefore, a total of at least 3n/10 elements are smaller than p. The
same argument can be made for elements larger than p.


Finally, let us consider the running time of this algorithm. We take
O(n) time to find the medians of n/5 sublists. We then take T(n/5)
time to recursively find a median of medians. This gives us a
pivot. We then take O(n) time to partition the list using this
pivot. Finally, we take T(7n/10) time to recursively solve the
selection problem over one of the partitions of the list.

Therefore, we have T(n) = T(n/5) + T(7n/10) + cn for some constant c.

There are several ways of solving this recurrence. Here, we use the
"recursion tree" method. At the top level, we spend cn time. At the
next level, we spend cn/5 time for the first recursive call, and
7cn/10 time for the second recursive call, which gives us a total of
9cn/10 time. Likewise, at the next level, we spend 81cn/100 time, and
so on. Our total time spent looks like

T(n) = cn + 9cn/10 + 81cn/100 + 729cn/1000 + ...

Note that this is a converging series with each term a multiplicative
factor smaller than the last. Even if this process continues forever,
the total time taken is dominated by the first term and is therefore
O(n).

This completes our analysis of the algorithm.

A few remarks are in order.

1. Note that in analyzing the running time of the above algorithm, one
   crucial property that we used was that n/5 + 7n/10 < n. That is,
   the sum of the sizes of the subproblems we solved recursively was
   strictly smaller than the size of the original problem. This turned
   out to be important because it implied that the running time of
   each recursive step would be a constant factor smaller than that of
   the previous level. Contrast this to the size of the subproblems we
   obtain for mergesort. This also hints at a metatheorem for
   recurrences:

   For constants c and a1, a2, ..., ak such that a1 + ... + ak < 1,
   the recurrence T(n) = cn + T(a1 n) + T(a2 n) + ... + T(ak n) solves
   to T(n) = Theta(n).

   You should prove this theorem as an exercise.

2. What is a lower bound on the selection problem? Note that the
   number of different possible outputs of the algorithm is n (each
   element in the list could be the median). This implies that any
   algorithm in the comparison-based model must take time 
   Omega(log n). However, note also that any algorithm must read the
   entire list before determining the median (think about why this is
   the case). Therefore, another (better!) lower bound on finding the
   median is n. Our algorithm above achieves the best possible running
   time (within constant factors).