---------------------------------------------------------------------
CS 577 (Intro to Algorithms)
Lecture notes: Selection Shuchi Chawla
---------------------------------------------------------------------
(This document is best viewed with a fixed width font.)
Linear-time Selection
=====================
Some times instead of sorting an entire list of elements, we may only
be interested in (say) what the smallest element is, or what the 10th
smallest element is, and so on. We can easily find the smallest
element, even the 10th smallest element, in a list in O(n) time, by
just scanning the list and keeping track of the smallest (10 smallest)
elements so far. This is much faster than first spending O(n log n)
time sorting the list and then looking up the required element in the
sorted list. Formally we are given an unsorted list of numbers and a
target k and we want to find the k-th smallest element. This is called
the selection problem.
Can we solve this problem in linear, i.e., O(n), time for any k? What
about k=n/2? Keeping track of the n/2 smallest elements no longer
works, and takes O(n^2) time in fact. We will present a divide &
conquer based algorithm that takes only O(n) time.
As for sorting and counting inversions, we could start by
splitting the list into two parts, performing some amount of work, and
then recursing to find the k'-th smallest element in one or both parts
for some appropriately chosen k'. The key to achieving linear time is
to obtain the following kind of recurrence for running time:
T(n) = T(n/2) + O(n).
Solve this reccurence to make sure you get T(n) = O(n).
So, our new goal is to split the list into two sublists so that after
O(n) amount of work we only need to recurse in one of the two
sublists. In particular, after a linear amount of work, we should be
able to determine which of the two sublists contains the k-th smallest
element. We will use a "pivoting" approach similar to what is used in
the quicksort algorithm. Recall that the quicksort algorithm sorts a
given list of elements by first picking an appropriate pivot. It
then divides the list into two parts -- one consisting of all elements
smaller than the pivot and the other consisting of all elements larger
than the pivot. It then recursively sorts these parts and outputs the
sorted first part followed by the pivot followed by the sorted second
part. We will analyze a randomized version of this algorithm later on
in the course.
For now, let us try to apply this pivoting idea to the selection
problem. Suppose that we pivot the list using the first element. For
example, if our list was (5,20,13,2,7,14,8,1,10), then picking 5 as
the pivot would divide the list into (2,1) and (20,13,7,14,8,10). Now,
if we are interested in finding the 5th smallest element in the
original list (the median), then, given that the first part is of
length 2, we know that the required element lies in the second part
and is, in particular, the 2nd smallest element in that part. Now we
can recursively find this 2nd smallest element in the second part.
How long does this procedure take? It takes n time to divide the list
into two parts, and then T(m) time in the recursive call, where m is
the size of the part we recurse on. What is a good estimate on this
value m? Well, in the worst case, we may pick the smallest element in
the list as the pivot, in which case m could be n-1. Our recurrence
would then look like T(n) = n + T(n-1). This solves to T(n) =
O(n^2). So we have not really made any progress. Ideally, it would be
great if we could pick a pivot for which m = n/2. Then, our recurrence
would look like T(n) = n + T(n/2). You should verify that this indeed
solves to T(n) = O(n).
So our goal now is to pick a pivot for which the size of the longer
part m is n/2. But such an element is the median of the list by
definition -- this is the same problem that we are trying to solve in
the first place! We will fix this issue by allowing ourselves a little
leeway -- instead of using the median as a pivot, we will pick a pivot
that partitions the list into *roughly* equal parts, with m being
7n/10. This idea and the algorithm below was invented by Blum, Floyd,
Pratt, Rivest and Tarjan in 1972.
Here is how we pick the pivot. We partition the list into n/5 groups
of 5 elements each. We then find the median in each group. Note that
this is a constant time operation per group since each group is of
constant size. This gives us a list of n/5 medians. We now proceed to
recursively find the median of this list of medians. Call is p. We
then use this median of medians p to pivot the original list.
Before we analyze the time complexity of this algorithm, we claim that
we pick a good pivot -- m is at most 7n/10.
Claim: There are at least 3n/10 elements larger than and at least
3n/10 elements smaller than the median of medians p picked above.
Proof: p is a median of n/5 medians. Therefore, at least n/10 of the
medians (including itself) are less than or equal to p. Since each of
these is a median of 5 elements, each of these medians accounts for a
total of 3 elements (including itself) that are smaller than p.
Therefore, a total of at least 3n/10 elements are smaller than p. The
same argument can be made for elements larger than p.
Finally, let us consider the running time of this algorithm. We take
O(n) time to find the medians of n/5 sublists. We then take T(n/5)
time to recursively find a median of medians. This gives us a
pivot. We then take O(n) time to partition the list using this
pivot. Finally, we take T(7n/10) time to recursively solve the
selection problem over one of the partitions of the list.
Therefore, we have T(n) = T(n/5) + T(7n/10) + cn for some constant c.
There are several ways of solving this recurrence. Here, we use the
"recursion tree" method. At the top level, we spend cn time. At the
next level, we spend cn/5 time for the first recursive call, and
7cn/10 time for the second recursive call, which gives us a total of
9cn/10 time. Likewise, at the next level, we spend 81cn/100 time, and
so on. Our total time spent looks like
T(n) = cn + 9cn/10 + 81cn/100 + 729cn/1000 + ...
Note that this is a converging series with each term a multiplicative
factor smaller than the last. Even if this process continues forever,
the total time taken is dominated by the first term and is therefore
O(n).
This completes our analysis of the algorithm.
A few remarks are in order.
1. Note that in analyzing the running time of the above algorithm, one
crucial property that we used was that n/5 + 7n/10 < n. That is,
the sum of the sizes of the subproblems we solved recursively was
strictly smaller than the size of the original problem. This turned
out to be important because it implied that the running time of
each recursive step would be a constant factor smaller than that of
the previous level. Contrast this to the size of the subproblems we
obtain for mergesort. This also hints at a metatheorem for
recurrences:
For constants c and a1, a2, ..., ak such that a1 + ... + ak < 1,
the recurrence T(n) = cn + T(a1 n) + T(a2 n) + ... + T(ak n) solves
to T(n) = Theta(n).
You should prove this theorem as an exercise.
2. What is a lower bound on the selection problem? Note that the
number of different possible outputs of the algorithm is n (each
element in the list could be the median). This implies that any
algorithm in the comparison-based model must take time
Omega(log n). However, note also that any algorithm must read the
entire list before determining the median (think about why this is
the case). Therefore, another (better!) lower bound on finding the
median is n. Our algorithm above achieves the best possible running
time (within constant factors).