Project 1a: Sorting within a Range

You will write a simple sorting program. This program should be invoked as follows:

shell% ./rangesort -i inputfile -o outputfile -l lowvalue -h highvalue

The above line means the users typed in the name of the sorting program ./rangesort and gave it four inputs:

  • an input file to sort called inputfile
  • an output file called outputfile to put the sorted results that fall between lowvalue and highvalue, inclusive
  • the low value, below which keys and values are discarded
  • the high value, above which keys and values are discarded

Input files are generated by a program we give you called generate.c (good name, huh?).

After running generate, you will have a file that needs to be sorted. It will be filled with binary data, of the following form: a series of 100-byte records, the first four bytes of which are an unsigned integer key, and the remaining 96 bytes of which are integers that form the rest of the record. Something like this (where each letter represents two bytes):

kkRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR

Your goal: to build a sorting program called rangesort that takes in one of these generated files and sorts it based on the 4-byte key (the remainder of the record should of course be kept with the same key). Keys which fall outside of lowvalue and highvalue should be discarded (and not written to the output file). The output is written to the specified output file.

Some Details

Using generate is easy. First you compile it as follows:

shell% gcc -o generate generate.c -Wall -Werror

Note: you will also need the header file sort.h to compile this program.

Then you run it:

shell% ./generate -s 0 -n 100 -o /tmp/outfile

There are three flags to generate . The -s flag specified a random number seed; this allows you to generate different files to test your sort on. The -n flag determines how many records to write to the output file, each of size 100 bytes. Finally, the -o flag determines the output file, which will be the input file for your sort.

The format of the file generated by the generate.c program is very simple: it is in binary form, and consists of those 100-byte records as described above. A common header file sort.h has the detailed description.

Another useful tool is dump.c . This program can be used to dump the contents of a file generated by generate or by your sorting program.

Hints

In your sorting program, you should just use open() , read() , write() , and close() to access files. See the code in generate or dump for examples.

If you want to figure out how big in the input file is before reading it in, use the stat() or fstat() calls.

To sort the data, use any old sort that you'd like to use. An easy way to go is to use the library routine qsort() .

To exit, call exit() with a single argument. This argument to exit() is then available to the user to see if the program returned an error (i.e., return 1 by calling exit(1) ) or exited cleanly (i.e., returned 0 by calling exit(0) ).

The routine malloc() is useful for memory allocation. Make sure to exit cleanly if malloc fails!

If you don't know how to use these functions, use the man pages. For example, typing man qsort at the command line will give you a lot of information on how to use the library sorting routine.

You will need to decide when you want to discard keys that fall outside of the specified range (i.e., discard a key if its value < lowvalue or its value > highvalue). You could either discard these keys one-by-one before you sort them, or you could discard two potentially large sequences of keys after you've sorted them. This choice is completely up to you and should only impact performance, not correctness.

Remember to keep keys that exactly match the low or highvalue (i.e., the range for selected keys is inclusive). Also remember that the specified range might include zero, one, or all of the keys in the original input set. Be sure that you handle the case where the lowvalue is equal to the highvalue (i.e., the sort should output all keys with their records that are exactly equal to lowvalue and highvalue).

Assumptions and Errors

32-bit integer range. You may assume that the keys are unsigned 32-bit integers.

File length: May be pretty long! However, there is no need to implement a fancy two-pass sort or anything like that; the data set will fit into memory.

Selected records: Depending upon the input set and the specified low and high values, the sort may end up sorting zero, one, a few, many, or all of the keys in the original input data set. Do not make any assumptions about the number of keys that might fall within the low and high values (inclusive).

Invalid files: If the user specifies an input or output file that you cannot open (for whatever reason), the sort should EXACTLY print: Error: Cannot open file foo (with no extra spaces, and assuming the file was named foo ) and then exit.

Invalid range: If the user specifies a high value that is less than the low value (or specifies values that are not unsigned 32-bit integers), the sort should EXACTLY print: Error: Invalid range value with no extra spaces and then exit.

Too few or many arguments passed to program: If the user runs rangesort without enough arguments, or in some other way passes incorrect flags and such to rangesort, print Usage: rangesort -i inputfile -o outputfile -l lowvalue -h highvalue and exit.

Important: On any error code, you should print the error to the screen using fprintf() , and send the error message to stderr (standard error) and not stdout (standard output). This is accomplished in your C code as follows:

fprintf(stderr, “whatever the error message is\n”);

History and an Optional Performance Contest

This sorting assignment derives from a yearly competition to make the fastest disk-to-disk sort in the world. See the sort home page for details. If you look closely, you will see that your professor was once -- yes, wait for it -- the fastest sorter in the world.

To continue in this tradition, we will also be holding a sorting competition. The range sort that you need to implement for this assignment is a little trickier than the straight-forward sort, since you may need to discard keys (and need to decide if you should discard keys before or after the actual sort). For the contest, we will average together your program's runtime on three input sets:

  • 1,000,000 input keys: approximately 20% of input keys selected by the range
  • 1,000,000 input keys: approximately 50% of input keys selected by the range
  • 1,000,000 input keys: approximately 80% of input keys selected by the range

Whoever wins the performance contest will win a soft-cover copy of the OSTEP text book.

Read more about sorting, including perhaps the NOW-Sort paper , for some hints on how to make a sort run really fast. Or just use your common sense! Hint: you'll have to think a bit about hardware caches.

Note: Your grade will not be impacted by the performance of your sorting program. Your grade will depend only on the correctness of your program. We will only measure the performance of assignments that sort correctly for all cases!

General Advice

Start small, and get things working incrementally. For example, first get a program that simply reads in the input file, one line at a time, and prints out what it reads in. Then, slowly add features and test them as you go. Don't worry about performance until you have all of the functionality working correctly.

Testing is critical. Testing your code to make sure it works is crucial. Write tests to see if your code handles all the cases you think it should. Be as comprehensive as you can be. Of course, when grading your projects, we will be. Thus, it is better if you find your bugs first, before we do.

Keep old versions around. Keep copies of older versions of your program around, as you may introduce bugs and not be able to easily undo them. A simple way to do this is to keep copies around, by explicitly making copies of the file at various points during development. For example, let's say you get a simple version of rangesort.c working (say, that just reads in the file); type cp rangesort.c rangesort.v1.c to make a copy into the file rangesort.v1.c . More sophisticated developers use version control systems like CVS (old days) or mercurial or github (modern times), but we'll not get into that here (though you can, and perhaps should!).

Keep your source code in a private directory. An easy way to do this is to log into your account and first change directories into private/ and then make a directory therein (say p1 , by typing mkdir p1 after you've typed cd private/ to change into the private directory). However, you can always check who can read the contents of your AFS directory by using the fs command. For example, by typing in fs listacl . you will see who can access files in your current directory. If you see that system:anyuser can read (r) files, your directory contents are readable by anybody. To fix this, you would type fs setacl . system:anyuser “” in the directory you wish to make private. The dot “.” referred to in both of these examples is just shorthand for the current working directory.