UW-Madison
Computer Sciences Dept.

CS739 Spring 2008: Project 2 -- Improved HadoopSort

Due: Monday, May 12
Work with: A partner
Make: Lots of graphs

Updates: None (yet).

Overview: In this assignment, you will be building on your previous external (disk-to-disk) sample sort. The project is broken down into four basic steps, for your viewing pleasure.

Step 1: Compile Hadoop. This is supposed to be the easy step. Make sure that you can (trivially) modify hadoop, compile it, and run your new version.

Step 2: Re-Run the "default" Hadoop sort on both uniform and non-uniform distributions of keys. Make sure that you configure the partition function so that the final output really is sorted across all involved machines; that is, the partition should be performed on the high-order bits of the key. Specifically, if you have R reducers, then you should use the top log_2 R bits to partition. You should run all of your experiments on 8 machines. What kind of performance do you get? One way to get better performance on the non-uniform distributions is to use more reducers. Make some graphs, varying the number of reducers. Does this help?

Step 3: Improve Hadoop. The main idea of the project is to modify Hadoop to provide some support that will improve your sample sort. Possibilities include: a primitive for reading random records in a file instead of streaming through all records; distributed counters so that different mappers/reducers know how many samples have been taken thus far; infrastructure to reduce the start-up time for two sequential instances of map/reduce that use the same nodes and files. Any other ideas?

Step 4: Compare Show the time for the 3 versions of sorting: default (but correct) hadoop sort, your sample sort from the mini-project, and your improved sample sort. Run on 8 machines and pick interesting values of R. Make sure that you break down the time for your sample sorts to show the sample phase and the full sort phase separately.

Some details:

Final presentations will be similar to your mini-project presentations. You should prepare slides and expect to talk for about 15 minutes. Feel free to talk to me anytime before then!