External Sorting with Threads
There are three main objectives to this assignment:
In this assignment, you will implement a simple external sorting program. The difference between an external sort and an internal sort is that in an external sort, the keys to be sorted begin in a file; in an internal sort, the keys are already in main memory. You will also be exploring an external sort in Assignment 2, so you might wonder why we are placing such a focus on one type of application. There are at least four good reasons:
First, sorting is extremely practical: it is an important technique used in
almost every area of computer science. Second, you should already be
familiar with sorting algorithms from your previous classes (e.g., insertion
sort, mergesort, quicksort, and radix sort), allowing you to focus on the
threading and synchronization issues in this assignment. Third, sorting, in
particular external sorting, stresses many areas of the operating system
(e.g., memory management and the file system), so it can help you learn
more about those system components from the perspective of a very demanding
application. And, finally, if you become an expert at external sorting,
you can show off by entering a
The goal of your external sort is to take multiple files of unsorted non-negative integer keys as input and to produce one file containing the same keys sorted in ascending order. While there are many ways that your program could accomplish this goal, for this assignment you must have two phases: a parallel sort phase followed by a merge phase.
In the first phase, you will start up one thread for each of the
In the second phase, a single thread merges together all of the runs into a single sorted output file. You do not need to implement anything sophisticated for this (for example, you do not need to implement a tournament tree). To perform a very simple merge, have a single thread compare the first key in each run, pick the minimum key, copy it to the output list, and remove that key from the top of its run. If this step is repeated until all keys have been removed from every run, the output list will contain all of the keys in sorted order.
At this point, you might be asking yourself why anyone would write a
sorting application with multiple threads? Why not just have a single
thread that reads and sorts each file sequentially? The reason for
creating a thread for each input file is that we can improve performance:
when one thread is waiting (e.g., for the OS to read from disk), another
thread can be running and performing useful work (e.g., sorting its data).
We will learn more about this in the course when we talk about Scheduling.
Program Specification
Your program must accept command line arguments as follows:
java ExtSort [numFiles] [baseInputFile] [OutputFile]where the fields are to be interpreted as follows.
java ExtSort 8 in outthen your program should read the input files "in.0", "in.1", "in.2", "in.3", "in.4", "in.5", "in.6", and "in.7" and write a single output file "out".
Defensive programming is an important concept in operating systems: an OS can't simply fail when it encounters an error; it must check all parameters before it trusts them. Therefore, your program must catch all possible exceptions and respond to all input in a reasonable manner. For this assignment, "reasonable" means that your program prints an understandable error message to the user in the following situations:
Make sure that you can handle input files that contain duplicate keys or
are entirely empty of keys! These are not error conditions.
Using Threads
One of the main purposes of this assignment is to become familiar with using
threads. Your primary class will create a new thread to read and sort each
of the input files. It will then wait until the threads have finished
before continuing its own execution. There are two ways to start threads in
Java. The first is to derive your class from the Thread class and then override its
run() function . The second is to
use the
Runnable interface. With
this approach, you create a class that implements Runnable and pass an instance of
this class into the constructor of a new thread object. Although the first
approach seems simpler at first, it is has some pitfalls, so we recommend
using the second approach.
Exceptions
Java requires you to place within a try block any methods that might cause an
exception. Following the try block is a catch clause (or catch clauses)
that will be used to catch any exceptions that have been thrown. See Java
for C++ Programmers for more information about exceptions. Your code
should deal with exceptions in an appropriate manner. For example,
attempting to open a file that does not exist should result in a message to
the user and termination of the program using
System.exit()).
Grading
For the remaining projects in this course, you will be working in
two-person teams, but for this project, you will work alone.
We have created a directory ~cs537-2/handin/NAME, where NAME is your login name. For example, if you login is lab, your handin directory is ~CS537-2/handin/lab. Your handin directory has five subdirectories: p1, p2, p3, p4, and p5. For this assignment, use directory p1. Copy all your .java source files and your README into your handin directory. Do not submit any .class files. After the deadline for this project, you will be prevented from making any changes in this directory. Remember: No late projects will be accepted!
Hand in your source code and a short README that documents your program. Your README should have two sections:
Even though you will not be heavily graded on style for this assignment,
you should still follow all the principles of software engineering you
learned in CS 302 and CS 367, such as top-down design, good indentation,
meaningful variable names, modularity, and helpful comments. After all,
you should be following these principles for your own benefit.
Suggestions
Note that additional hints and explanations may be found on the FAQ (Frequently Asked Questions) page for this
project. You should get in the habit of checking this page daily, as new
hints and clarifications may be added at any time.
This project is not meant to be difficult; your finished program will probably be under 200 lines, including comments. If you find that you are writing a lot of code, it probably means that you are doing something wrong and should take a break from hacking and instead think about what you are trying to do.