Database Management Systems

by Raghu Ramakrishnan and Johannes Gehrke

Assignment 1: Data-Oriented C++ Programming

(Due Date: Thursday 9/15/94)


Introduction

The purpose of this assignment is to get you acquainted with C++ and Unix in order to prepare you for the course project. Specifically, it will illustrate classes, inheritance, abstract base classes, (pure) virtual methods, pass-by-reference, (operator) overloading, the iostream library, new/delete and other C++ features. If you are not experienced in C++ coding, this assignment should get you acquainted with some useful techniques, and more importantly, give you some indication of how difficult (or easy!) it is going to be for you to learn C++. Note that I will not teach C++; you are expected to know it or to pick it up on your own.

What You Have to Do

The goal of this assignment is to write code for reading in sorted files of fixed-length records and merging them. To keep this assignment simple, we will assume that all records consist of one string (with a defined maximum length) and one integer. Input data files will be in binary format, and the string will occur first, followed by the integer. It should be noted that strings will have to fit in the given maximum length inclusive of the null terminating character.

Step 1:

Define a class "RecSpec" which represents the specifications of a type of record. Note that RecSpec does not represent an actual record. The constructor of RecSpec accepts two arguments: the maximum length for strings, and an indication of which field (integer or string) will be used for sorting criterias.

A (partial) class definition of RecSpec follows:


enum BOOL { FALSE, TRUE};

class RecSpec {
public:

    enum field_type { INT, STR };

    RecSpec(unsigned int maxstrlen, field_type sortfield);

    BOOL operator==(const RecSpec &rs) const;
    int compare(const void *rec1, const void *rec2) const;

    BOOL print_rec(const void *rec) const;
    BOOL get_from_file(ifstream &is, void *buf) const;

    unsigned int rec_size() const;

private:

    const unsigned int  max_str_len;
    const field_type    sort_field;
};

When an object of type RecSpec is declared, the private fields are initialized by the constructor in the obvious way. Again, note that the RecSpec object has no field to actually store a record; it simply keeps some descriptive information about (a collection of) records, and provides some useful methods to manipulate such records. field_type is used to specify (to the constructor) which field the records will be sorted on. operator== is used to verify that two RecSpec objects are the same, allowing their associated records to be merged (see step 2). Two RecSpec objects are defined to be equal if they both use the same field for sorting, and if they have the same maximum string length. compare compares the sort fields of the two given records. It returns a negative value is rec1 < rec2, 0 if rec1 == rec2, and a positive value if rec1 > rec2. print_rec "nice-prints" a record to the standard output (see cout). get_from_file reads a record from an input file into a buffer. It does not manipulate the data in any shape or form. You should use the fstream classes to do this (see section 3 for more details). rec_size returns the size of the records associated with this RecSpec (you might want to think about making this an inline method).

Step 2:

Define a class "Seq" (sequence) and two subclasses "FileSeq" and "MergeSeq". The parent class (Seq) is only used to provide an abstract interface to both its derived classes (FileSeq and MergeSeq). It cannot be instanciated (see below). Sequences are basically pipes through which records are passed up, one at a time. FileSeq reads records from a file and passes them up, sorted, to its user (You can assume that the records in a data file are already sorted). MergeSeq merge-sorts records from its two Seq inputs. Note that each input for one MergeSeq object can be either FileSeq or MergeSeq, and that a tree of Seqs can be built to merge an unlimited number of input files.

A (partial) class definition for Seq is shown below:


class Seq {
public:

    Seq(const RecSpec *recinfo);
    ~Seq();

    virtual BOOL get_next_rec(void *&recptr) = 0;

    inline const RecSpec *get_rec_spec() const { return rec_info; }


protected:

    const RecSpec   *rec_info;
};

The constructor for Seq should initialize rec_info. get_rec_spec should return a const pointer to the RecSpec. get_next_rec attaches the pointer argument to a buffer containing the next record. It returns TRUE on success and FALSE on error. If there are no more records to be passed, get_next_rec returns TRUE, but assigns the pointer argument to NULL. You should check recursively down a Seq tree that all RecSpecs are in fact equal before get_next_rec can be executed (see MergeSeq for more information on this).
Since Seq contains ``pure'' virtual functions, note that it is illegal to declare objects of class Seq. You can only declare objects of some subclass of Seq.

A (partial) class definition for FileSeq is shown below:


class FileSeq   : public Seq {
public:
    
    FileSeq(const char *name, const RecSpec *recinfo);
    ~FileSeq();

    virtual BOOL get_next_rec(void *&recptr);

private:
    
    ifstream        in_file;
    char            *buffer;
};

The constructor for FileSeq should open a file for reading and allocate a buffer large enough for the record sizes it will have to handle.

The destructor will close the file and free the buffer.

get_next_rec should read the next record into the buffer and return it as was discussed for Seq.

A (partial) class definition for MergeSeq is shown below:


class MergeSeq  : public Seq {
public:
    MergeSeq(Seq *s1, Seq *s2);
    ~MergeSeq();

    virtual BOOL get_next_rec(void *&recptr);

private:

    Seq     *seq1;
    Seq     *seq2;
};

The MergeSeq class constructor accepts two Seq objects as arguments. It should first check that the sequences are mergeable by comparing their two RecSpecs (using the == operator). If so, a MergeSeq object should be created with the RecSpec object returned from the get_rec_spec method on one of the input sequences passed as the argument to the Seq constructor.

The destructor should recursively delete its children. This way, a Seq tree can be deleted simply by calling delete on its root.

get_next_rec should compare the two current records, using the RecSpec::compare method, and return the lesser. Some sort of a flag will be needed to remember which input's current record has not yet been used (the greater record from the last call).

Empty sequences and end-of-sequence should be handled along the lines indicated in FileSeq.


Points to Note

It can be assumed that at any time, at most one FileSeq object refers to a given file, and that at most one MergeSeq object (directly) refers to a given FileSeq object.

Use the fstream classes for I/O operations. (To do so, you must #include both iostream.h and fstream.h in your header file.)

Both FileSeq and MergeSeq objects should operate in ``lazy manner'', e.g., the input file should be read one record at a time in response to successive calls to get_next_rec on a FileSeq object.

If you want to modify any of the class definitions shown above by adding/removing/modifying private and/or protected members, feel free to do so. But you must keep the public interfaces as they have been described above,

You should use C++'s memory management (new and delete). Do not use malloc and free! Be sure to study how to use new and delete for arrays.


A Word on Documentation, Error Checking and Style

As will be the case all semester in CS 564, good documentation is expected. We will expect to see both external and internal (i.e., inline) documentation. A good rule of thumb is to comment your programs as if you were going to hand them to another programmer to finish, someone who has absolutely no idea what you were doing when you wrote it (and cannot talk to you to find out).

You are expected to check for error conditions, and ensure that your code including all the public methods above terminates gracefully on all inputs, with meaningful error msgs. While some error conditions have been explicitly discussed, e.g., get_next_rec on an empty file, others have not, e.g., what happens when RecSpec::get_from_file is called on an empty file. So think carefully about potential errors, and write defensive code.


Program Grading Criteria

For this and future programming assignments, working code alone will not guarantee a good grade. Documentation, error checking, and programming style will also be taken into consideration when programs are graded. The relative importance of these categories is as follows:
  • Works as specified: 65%
  • Good documentation: 15%
  • Good coding style: 10%
  • Good error checking: 10%

What to Turn In

The TAs will put information on-line about what to do for the final testing and running of your code. You should turn in copies of your code together with copies of the output produced by running what they ask you to run.

Finding the Files

Files for CS 564 will be placed in the directory /u/course/cs564-1. Testing information, constant definitions, routine templates, and test data files will all be placed in a subdirectory of this directory called assign1. You should simply make yourself a copy of each of the files that you find there. In addition, remember to send mail to the TAs (to "cs564-1") so that they can add you to their CS 564 student mailing list. Otherwise, you are likely to miss out on useful information about this and future CS 564 programming assignments. Be sure to check your mail regularly.

Due Date

Assignment #1 is due on Thursday, September 15, at 5:00 PM. Note that the late policy for CS 564 implies that (i) lateness will lead to a loss of 5 percentage points per day and (ii) no late assignments will be accepted at all after 5:00 PM on Sunday, September 18. The TAs will make an on-line solution available the following Monday for those who wish to look at it.