CS367 Introduction to Data Structures- Summer 2006

A4 Sorting!

Announcements and Clarifications

7/13 - Due data changed to Fri 7/14 @ 5pm

Brief Description

Your task in this assignment to is implement several different types of sorts and to experiment with sorting several different data sets.

Goals and Requirements

Implement a list ADT
Implement bubble sort
Implement insertion sort
Implement mergesort
Implement quicksort
Implement binary search

Description

Your first goal is to implement a list ADT, you may implement this data structure however you choose, but consider the running time of various operations that are necessary for sorting.

You should implement two versions of bubble sort, one that uses the list ADT you implemented and one that uses Java's ArrayList class. The reminder of your sort should all use the same data structure to store the list of things to sort. You must also implements two versions of quick sort: the first is the heuristic where you choose the first element as the pivot and the second should choose the median value as the pivot using the median finding algorithm. You also need to implement a mode that reads in the data to sort but does not perform any operations.

Your sorts should always order the elements in increasing lexicographic order.

Implementation

All of the user input will be as command line arguments:

java Sorts mode_number input_file output_file amount_of_data ...

The mode_number indicates which operation will be performed:

Bubble sort using ArrayList
Bubble sort using your list class
Insertion sort
Merge sort
Quick sort picking first element as pivot
Quick sort using the median method to choose pivot
Sort using the Arrays.sort() method
No-op, reads in data but does no sorting
Binary search

For all modes input_file indicates the file to read the data from. For the binary search mode, assume that the input file contains data already in sorted order.

For all modes output_file indicates the name of the file to output the results to. For the sorts, the only output to the file will be a list of the data in sorted order, one piece of data per line. For the binary search mode, the only output should be the pivot element in each step of the algorithm on its own line, if item is found the last line should be that item otherwise the last line should be "NOT FOUND".

For all modes amount_of_data indicates the number tokens to read from the data file. Use Scanner.next() to read in the tokens.

For the binary search mode there will be another command line argument that specifics the token to search for. Your implementation of the binary search should be recursive.

No-op mode will read in the data from input_file, place it in the ArrayList and then output the data to output_file in the one token per line format.

If there are any I/O problems with the input file or the output file indicate that error on standard error and exit the program gracefully. If there are fewer tokens in the input file than are requested to be read in, indicate an error and exit the program.

Data sets

We have several data sets for you to use. Consider all of the data tokens as Strings when sorting. Use the standard Scanner.next() to tokenize the data with the default separators.

Shakespeare's Hamlet (source) 31999 tokens
First 1 million digits of pi (source) 1000000 tokens
First 100k numbers increasing order - 100000 tokens
First 100k numbers decreasing order - 100000 tokens

The data sets are human readable, look through them before you start coding.

Questions to Answer

At the end of your README.txt file include answers to the following questions:

For each of the data sets run all modes except binary search using amount_of_data in increments of 1500 up to 15000 measuring the running time using the method described in assignment 3, the hints section in this assignment describes a way to automate this process. You may choose to measure with larger data sets, but be aware that for some of the algorithms this will take a long time. For each of the data sets plot the results and submit an electronic copy of the plot (.pdf, .ps or .xls). Analyze and comment on the plots. Include estimates of the what the experiment tells about the asymptotic running times of the sorts for each of the data sets.
Why did we ask you to implement a mode that reads in the file and outputs it but does not sort the data? (update your plots to account for this)
Often the known structure of data can help you improve sort performance. For the first two data sets name two distinct ways to improve sort performance based on what you know about the data.
Comment on the difference between the results in the third and fourth data sets.
Compare the performance of bubble sort on your list and ArrayList, were there differences? What do you think accounts for these differences?
Compare the two versions of quick sort on all test sets.
Which algorithm do you think Arrays.sort() uses to sort data? Why?
If you had used LinkedList instead of ArrayList how would your results have changed?

Commenting and Style

Your program should be written in a style that makes it easy to read and understand.
At the beginning of each .java file you should include a description of the class and how it interacts with the other parts of your program.
You are not required to use javadoc style comments.
If your code is doing something complex or non-standard please comment that portion heavily.

Handin

Please hand all necessary files into your handin directory in a subdirectory named Sorts. Your application class should be called Sorts.java. There should be exactly one class declared within each .java file. If your program does anything strange (bugs), awesome(extra features) or has a non-intuitive interface please include a file called README.txt which explain them. If there are bugs in your program but you do not describe them in your README you will lose more credit than if you had described them.

Hints

Develop incrementally.
Remember to design your list data structure so that operations important to sorting are fast: getting items from an index, swapping items, comparing items...
Test your sorts on small sets of your own data so that you verify correctness by hand.
The Linux command diff allows you to determine if two files differ. To use:
```
% diff file1 file2
```
Automating your measurements - It's a pain to run each of the test cases individually and record the results. There's a way to make it easier. First we need to use a more versatile time command. There is another time command that allows you to choose the format of the output:
```
 %/usr/bin/time -f "%U" java Sorts pi.txt pi_out.txt 15000 
```
This will change it so that only the value we are looking for (the time elapsed while running our program) is displayed. This is actually the first number that the time command we discussed in assignment 3 produces, however, it is more representative of tests which are shorter in duration. It does not match the wall clock time it takes for the test to run however.
We can make things even easier by writing something called a shell script, it will allow us to run a batch of programs in sequence. We provide a script to automate the tests you must run, you can find it here.
To use the script first copy it to your working directory, then run the command:
```
%chmod u+x test.sh
```
This will allows the shell to execute the our script (we only have to do this once). To run the script use:
```
%test.sh mode_number input_file
```
This will produce the timing for running your program with the specified mode on the specified input file for 0, 1500, 3000, ... , 15000 tokens. Open the script in an editor. Each line of the script is run in order as though you had typed it into the command line. The $1 and $2 are like variables that represent the command line arguments to the script. In our case $1 represents the mode number and $2 represents the name of the input file. The first line of the script tells the operating system what language our script is written in.
Try writing another shell script to run all the tests for one data set (first make a new file and use the chmod command on that file). There is one additional command that might help you. It is called echo, it works exactly like println in Java. To print Hello world to the console you would use echo "Hello world" in your script.
Running any of the sorts on any of the data sets should take no more than two minutes on a lab machine (in fact it should take much less time), if it takes longer you're probably doing something very inefficiently, look at your data structures and analyze the running time of your methods.

Introduction to Data Structures
CS367 - Summer 2006

A4 Sorting!

Announcements and Clarifications

Brief Description

Goals and Requirements

Description

Implementation

Data sets

Questions to Answer

Commenting and Style

Handin

Hints

Sections

Assignments

Links

Introduction to Data Structures CS367 - Summer 2006

A4 Sorting!

Announcements and Clarifications

Brief Description

Goals and Requirements

Description

Implementation

Data sets

Questions to Answer

Commenting and Style

Handin

Hints

Sections

Assignments

Links

Introduction to Data Structures
CS367 - Summer 2006