UW-Madison
Computer Sciences Dept.

CS739 Spring 2008: Project 1 -- HadoopSort

Due: In two weeks (April 8)
Work with: A partner
Make: Lots of graphs

Updates: None (yet).

Overview: In this assignment, you are going to gain some familiarity with Hadoop, by building an external (disk-to-disk) sample sort. The project is broken down into four basic steps, for your viewing pleasure.

Step 1: Get Hadoop up and running. This is supposed to be the easy step. Get a simple word count (or other) program running on both a single machine (which will be useful for debugging) and a cluster of machines.

Step 2: Create a program in Hadoop that generates keys. For this sorting task, you are to write a program that creates some number of records that your sorting program will, well, sort. Each record should be 100 bytes in size; the first 10 bytes will be the key which you sort. The RandomWriter class may be a good starting point.

How many keys should you generate? Well, we are going to be shooting for the MinuteSort benchmark, which asks you to sort as many of these records as possible within a minute.

What kind of distribution should you generate? At first, start with a simple uniform distribution. However, to make this project more interesting, you will also have to generate some non-uniform distributions (otherwise, a sample sort isn't so interesting). You have some flexibility here; pick something that you can play around with, making it more or less skewed towards some part of the key range. Also, make sure that your distribution is coming out as planned! (don't just assume it is, prove that it is, to yourself, and to me).

Step 3: Run the default Hadoop sort on both uniform and non-uniform distributions. You should do this on 8 machines. What kind of performance do you get? One way to get better performance on the non-uniform distributions is to use more reducers. Make some graphs, varying the number of reducers. Does this help? Making lots of graphs is a good idea, because it convinces me that you care.

Step 4: Make a sample sort. The sample should look through the data and figure out some things about the key distribution. With this knowledge, the sort should be able to do better load balancing. Some open questions: which keys to sample? Should only one node sample? Should each node sample? Which parts of the file should be sampled? And finally, how do you use this information to make the sort work better? Again, I expect lots of graphs. Really.

Some details: The cluster consists of machines adsl01.cs through adsl08.cs. You should have an account on them. If not, please email me! Do most of your development on a single machine, and just use these for performance runs on 4 or 8 machines. We will figure out a solution for sharing shortly.

Also, please read this paper.

Finally, you need to write up what you did. Thus, I expect a report (with all those graphs) when all is said and done. The report should probably be 8-10 pages long, with as much space as you need for figures and code (if needed).