CS564, Fall08 - Homework 5: Implementing Join Algorithms

CS564 Fall 2008
Project 5: Implementing Join Algorithms
Assigned: Friday November 14^thDue: Thursday November 25^th at midnight

Introduction

In this assignment, you are expected to implement two join algorithms and for extra credt evaluate their relative performance. You will carry out this assignment in teams of two with the same partner as for the previous assignments.

In this project, we are giving you more modules of the Minirel DBMS. You will have the following modules already implemented:

the default buffer manager (which for simplicity is not using love/hate method)
a simple disk space manager with heap files
b+ tree index implementation

Since this version of Minirel that we are releasing for you is more complete and complicated than previous versions, some of the interfaces that you dealt with in previous two assignemnts have changed. We suggest that you take a look and study the following interfaces before implementing the project:

disk space manager as implemented in the DB class (db.h)
heap file layer for the disk space manager implemented using HFPage, HeapFile and Scan classes (hfpage.h, heapfile.h, and scan.h)
buffer manager implemented using BufMgr, Replacer classes (buf.h)
b+ tree index implementation. you are not going to need this module for the most part, except the Sort class (sort.h) which you will need to implement sort-merge join operator

Background

In Minibase, a relation is implemented as a heap file, which is a collection of records. Records can be inserted or deleted from a heap file, and each record is uniquely identified by a record id. A scan is the interface used to access records in a heap file, one by one.

An index provides fast access to records in the heap file, and currently Minibase only supports B+-indices. Each entry in the index is a (key, record id) pair. Entries can be inserted or deleted from an index. An index search scan provides interface for accessing the records in the index. You will not need the index access method, unless you are implementing index nested loop join.

You should begin by reading the chapter on Implementation of Relational Operations, in particular, the section on Block Nested Loop Join and Sort-Merge Join

What You Have to Implement

You have to finish the implementation of BlockJoin and SMJoin classes to get the full credit. Also for extra credit you will need to generate a report (in doc or txt or pdf format) that evaluates the relative performance of the two join methods you implemented. In particular you should collect and report the times taken for each algorithm to run, the number of page misses, and the effect of the buffer pool size on the two algorithms by changing the buffer pool size (BUF_POOL_SIZE defined in the tester class).

In SMJoin and BlockJoin classes, the constructor joins two relations R and S, represented by the heapfiles filename1 and filename2, respectively, using the sort-merge/block nested loop join algorithm. Note that the columns for relation R(S) are numbered from 0 to len_in1 - 1 (len_in2 - 1). You are to concatenate each matching pair of records and write it into the heapfile filename3. The error layer for these classes is JOINS as defined in new_error.h

In order to implement the Sort-Merge join, you will need to use the following classes which are given to you: Sort, HeapFile, and Scan. You will call the Sort constructor to sort the input heapfiles (which means your primary responsibility will be to implement the merging phase of the algorithm). To compare the join columns of two tuples, you will call the function tupleCmp (declared in sort.h). Once a scan is opened on a heapfile, the scan cursor can be positioned to any record within the heapfile calling the Scan method position with an RID argument. The next call to the Scan method getNext will proceed from the new cursor position.

To implement the block nested loop join method, you need to study the algorithm for this method in your textbook. Since Minibase does not provide page-by-page access into a relation stored in a heap file, you will simulate a "block" with an array storing the records. The join function will take as a parameter, an integer, block_size, of the outer relation. To simulate a block of pages, you need to copy block_size number of tuples each time and save them in an array, then use that array as a block of tuples. You are not required to implement the double buffering techniques or use hash tables.

Statistics collection (for extra credit)

Augment the buffer manager class given to you to collect statistics. Specifically, you should be able to tell how many PinPage requests have been made and how many of those miss. It is suggested that you write two functions to reset and report the statistics.

You can use the clock/time functions to determine the running time of your join algorithm. Refer to the C++ help for more details on time. A simple example can be found in main.C on how to use clock function.

You should let your algorithm run for a few times (the longer the better) and report the average running time. Avoid running other CPU or I/O intensive processes (Email client, browser, etc.) while collecting statistics since you are reports actual time rather than CPU time. Print out average running times and the number of misses for every join algorithm you implement. You may need to change the test files and create larger relations with many tuples to be able to analyze the difference between the various join algorithms.

In your report, you need to determine whether it is beneficial to sort the file for the purpose of performing a single join operation by comparing the cost of this join with that of other join method(s).

Important: It is strongly recommended that you finish the implementation of this homework at least 3-4 days before the due date. It will take you a few days to be able to run good experiments and generate a good report. Obviously this is only for people who want to get extra credits by generating a good analysis report document.

Source Code

Copy all the files from ~cs564-1/public/joins and start by reading the interface of various modules that you'll need to use in your implementation!

Finishing

Hand in your source code (including all the source files and Makefile). We have created a directory ~cs564-1/handin/NAME/homework5, where NAME is your login name. For example, if your login is jimmyv, your handin directory is ~cs564-1/handin/jimmyv/homework4. Copy all of your .c and .h source files into the appropriate subdirectory. Do not submit any .o files. Make sure that your code runs correctly on the linux machines in the 13XX labs.

Grading criteria

Your assignments will be graded according to the following criteria:

Correctness (80%): You will get full marks for a correct implementation. Partial credit will be to partially correct answers.
Coding (20%): You are expected to write neat code and comment it properly.
Documentation (30% extra credit): Your documentation should contains results of your experiments in table and graph format. You should include a brief analysis, explaining the results that you have submitted. (You should not just reproduce the formulas in the textbook.) Please call your document hw5-report.doc or hw5-report.pdf.

CS564 Fall 2008 Project 5: Implementing Join Algorithms Assigned: Friday November 14th Due: Thursday November 25th at midnight