Due: Monday, May 12
Work with: A partner
Make: Lots of graphs
Updates:
None (yet).
Overview:
In this assignment, you will be building on your previous external (disk-to-disk) sample sort. The project is broken down
into four basic steps, for your viewing pleasure.
Step 1: Compile Hadoop. This is supposed to be the easy
step. Make sure that you can (trivially) modify hadoop, compile it,
and run your new version.
Step 2: Re-Run the "default" Hadoop sort on both uniform and
non-uniform distributions of keys. Make sure that you configure
the partition function so that the final output really is sorted
across all involved machines; that is, the partition should be
performed on the high-order bits of the key. Specifically, if you
have R reducers, then you should use the top log_2 R bits to
partition. You should run all of your experiments on 8
machines. What kind of performance do you get? One way to get
better performance on the non-uniform distributions is to use more
reducers. Make some graphs, varying the number of reducers. Does this
help?
Step 3: Improve Hadoop. The main idea of the project is to modify Hadoop to provide
some support that will improve your sample sort. Possibilities
include: a primitive for reading random records in a file
instead of streaming through all records; distributed counters so that
different mappers/reducers know how many samples have been taken thus
far; infrastructure to reduce the start-up time for two sequential instances of
map/reduce that use the same nodes and files. Any other ideas?
Step 4: Compare Show the time for the 3 versions of
sorting: default (but correct) hadoop sort, your sample sort from the
mini-project, and your improved sample sort. Run on 8 machines and
pick interesting values of R. Make sure that you break down the time for
your sample sorts to show the sample phase and the full sort phase separately.
Some details:
Final presentations will be similar to your mini-project
presentations. You should prepare slides and expect to talk for about
15 minutes. Feel free to talk to me anytime before then!