Due: In two weeks (April 8)
Work with: A partner
Make: Lots of graphs
Updates:
None (yet).
Overview:
In this assignment, you are going to gain some familiarity with Hadoop, by
building an external (disk-to-disk) sample sort. The project is broken down
into four basic steps, for your viewing pleasure.
Step 1: Get Hadoop up and running. This is supposed to be the easy
step. Get a simple word count (or other) program running on both a single
machine (which will be useful for debugging) and a cluster of machines.
Step 2: Create a program in Hadoop that generates keys. For this
sorting task, you are to write a program that creates some number of records
that your sorting program will, well, sort. Each record should be 100 bytes in
size; the first 10 bytes will be the key which you sort. The
RandomWriter class may be a good starting point.
How many keys should you generate? Well, we are going to be shooting for
the MinuteSort
benchmark, which asks you to sort as many of these records as possible
within a minute.
What kind of distribution should you generate? At first, start with a
simple uniform distribution. However, to make this project more interesting,
you will also have to generate some non-uniform distributions (otherwise, a
sample sort isn't so interesting). You have some flexibility here; pick
something that you can play around with, making it more or less skewed towards
some part of the key range. Also, make sure that your distribution is coming
out as planned! (don't just assume it is, prove that it is, to yourself, and
to me).
Step 3: Run the default Hadoop sort on both uniform and non-uniform
distributions. You should do this on 8 machines. What kind of performance
do you get? One way to get better performance on the non-uniform distributions
is to use more reducers. Make some graphs, varying the number of
reducers. Does this help? Making lots of graphs is a good idea, because it
convinces me that you care.
Step 4: Make a sample sort. The sample should look through the data and
figure out some things about the key distribution. With this knowledge, the
sort should be able to do better load balancing. Some open questions: which
keys to sample? Should only one node sample? Should each node sample? Which
parts of the file should be sampled? And finally, how do you use this
information to make the sort work better? Again, I expect lots of
graphs. Really.
Some details:
The cluster consists of machines adsl01.cs through adsl08.cs. You should have
an account on them. If not, please email me! Do most of your development on a
single machine, and just use these for performance runs on 4 or 8 machines. We
will figure out a solution for sharing shortly.
Also, please read this paper.
Finally, you need to write up what you did. Thus, I expect a report (with
all those graphs) when all is said and done. The report should probably be
8-10 pages long, with as much space as you need for figures and code (if
needed).