CS 537-2: Spring 2000
Programming Assignment I


External Sorting with Threads

Due: Tuesday, February 8th, 2000 at 8 pm.

Test Suites will be run the following day.


Contents


Objectives

There are three main objectives to this assignment:

  1. To familiarize yourself with the Sun workstations in the lab.
  2. To learn about the more advanced features of Java, for example:
    starting threads, waiting for their termination, and dealing with exceptions.
  3. To develop your defensive programming skills.

Overview

In this assignment, you will implement a simple external sorting program. The difference between an external sort and an internal sort is that in an external sort, the keys to be sorted begin in a file; in an internal sort, the keys are already in main memory. You will also be exploring an external sort in Assignment 2, so you might wonder why we are placing such a focus on one type of application. There are at least four good reasons:

First, sorting is extremely practical: it is an important technique used in almost every area of computer science. Second, you should already be familiar with sorting algorithms from your previous classes (e.g., insertion sort, mergesort, quicksort, and radix sort), allowing you to focus on the threading and synchronization issues in this assignment. Third, sorting, in particular external sorting, stresses many areas of the operating system (e.g., memory management and the file system), so it can help you learn more about those system components from the perspective of a very demanding application. And, finally, if you become an expert at external sorting, you can show off by entering a competition each year for who can sort the most keys in a minute.

The goal of your external sort is to take multiple files of unsorted non-negative integer keys as input and to produce one file containing the same keys sorted in ascending order. While there are many ways that your program could accomplish this goal, for this assignment you must have two phases: a parallel sort phase followed by a merge phase.

In the first phase, you will start up one thread for each of the numFiles input files; each thread reads into memory the keys from its input file, sorts its keys in ascending order, and then terminates. You do not need to write a sorting routine yourself you can use a suitable Java library routine if you wish. After all of the threads have finished executing, you will have numFiles sorted lists of keys in memory; we will call this sorted list a run.

In the second phase, a single thread merges together all of the runs into a single sorted output file. You do not need to implement anything sophisticated for this (for example, you do not need to implement a tournament tree). To perform a very simple merge, have a single thread compare the first key in each run, pick the minimum key, copy it to the output list, and remove that key from the top of its run. If this step is repeated until all keys have been removed from every run, the output list will contain all of the keys in sorted order.

At this point, you might be asking yourself why anyone would write a sorting application with multiple threads? Why not just have a single thread that reads and sorts each file sequentially? The reason for creating a thread for each input file is that we can improve performance: when one thread is waiting (e.g., for the OS to read from disk), another thread can be running and performing useful work (e.g., sorting its data). We will learn more about this in the course when we talk about Scheduling.

Program Specification

Your program must accept command line arguments as follows:

	java ExtSort [numFiles] [baseInputFile] [OutputFile]
where the fields are to be interpreted as follows. For example, if you run your program as

	java ExtSort 8 in out
then your program should read the input files "in.0", "in.1", "in.2", "in.3", "in.4", "in.5", "in.6", and "in.7" and write a single output file "out".

Defensive programming is an important concept in operating systems: an OS can't simply fail when it encounters an error; it must check all parameters before it trusts them. Therefore, your program must catch all possible exceptions and respond to all input in a reasonable manner. For this assignment, "reasonable" means that your program prints an understandable error message to the user in the following situations:

These requirements will be tested extensively!

Make sure that you can handle input files that contain duplicate keys or are entirely empty of keys! These are not error conditions.

Using Threads

One of the main purposes of this assignment is to become familiar with using threads. Your primary class will create a new thread to read and sort each of the input files. It will then wait until the threads have finished before continuing its own execution. There are two ways to start threads in Java. The first is to derive your class from the Thread class and then override its run() function . The second is to use the Runnable interface. With this approach, you create a class that implements Runnable and pass an instance of this class into the constructor of a new thread object. Although the first approach seems simpler at first, it is has some pitfalls, so we recommend using the second approach.

Exceptions

Java requires you to place within a try block any methods that might cause an exception. Following the try block is a catch clause (or catch clauses) that will be used to catch any exceptions that have been thrown. See Java for C++ Programmers for more information about exceptions. Your code should deal with exceptions in an appropriate manner. For example, attempting to open a file that does not exist should result in a message to the user and termination of the program using System.exit()).

Grading

For the remaining projects in this course, you will be working in two-person teams, but for this project, you will work alone.

We have created a directory ~cs537-2/handin/NAME, where NAME is your login name. For example, if you login is lab, your handin directory is ~CS537-2/handin/lab. Your handin directory has five subdirectories: p1, p2, p3, p4, and p5. For this assignment, use directory p1. Copy all your .java source files and your README into your handin directory. Do not submit any .class files. After the deadline for this project, you will be prevented from making any changes in this directory. Remember: No late projects will be accepted!

Hand in your source code and a short README that documents your program. Your README should have two sections:

The majority of your grade for this assignment will depend upon how well your implementation works. On the Wednesday after the assignment is due, we will have a Demo Day. You will sign up for a time-slot with either the TA or the Instructor in one of the Instructional labs. We will run your program on a suite of about 20 test cases, some of which will exercise your programs ability to correctly sort different input files and some of which will test your programs ability to catch error conditions. Be sure that you thoroughly exercise your program's capabilities on a wide range of test suites, so that you will not be unpleasantly surprised when we run our tests. Your grade will be determined by: We will check your code to verify that you are starting up a thread for every input file; if you do not follow this requirement, you will receive absolutely no credit for the project.

Even though you will not be heavily graded on style for this assignment, you should still follow all the principles of software engineering you learned in CS 302 and CS 367, such as top-down design, good indentation, meaningful variable names, modularity, and helpful comments. After all, you should be following these principles for your own benefit.

Suggestions

Note that additional hints and explanations may be found on the FAQ (Frequently Asked Questions) page for this project. You should get in the habit of checking this page daily, as new hints and clarifications may be added at any time.

This project is not meant to be difficult; your finished program will probably be under 200 lines, including comments. If you find that you are writing a lot of code, it probably means that you are doing something wrong and should take a break from hacking and instead think about what you are trying to do.


dusseau@cs.wisc.edu
Thu Jan 22 17:49:38 CST 2000