Homework Assignment #2

Homework Assignment #2
Due Sunday, 02/19, 11:59 PM

No submissions will be accepted after 02/24.

Part 1

For this part of the homework, you are to write a program that implements both Naive Bayes and TAN (tree-augmented Naive Bayes).

Input

Your program should read files that are in the ARFF format.

In this format,

Each instance is described on a single line.
The feature values are separated by commas, and the last value on each line is the class label of the instance.
Each ARFF file starts with a header section describing the features and the class labels.
Lines starting with '%' are comments.

See the link above for a brief, but more detailed description of the ARFF format. Your program needs to handle only discrete attributes, and simple ARFF files (i.e. don't worry about sparse ARFF files and instance weights). Example ARFF files are provided below. Your program can assume that the class attribute is named 'class' and it is the last attribute listed in the header section.

Program specifications

Specifically, for this assignment, you should assume:

Your code is intended for binary classification problems.
All of the attributes are discrete valued.
Your program should be able to handle a variable number of attributes with possibly different numbers of values for each attribute.
You use Laplace estimates (pseudocounts of 1) when estimating all probabilities.

For the TAN algorithm. Your program should:

Use Prims's algorithm to find a maximal spanning tree (but choose maximal weight edges instead of minimal weight ones).
Initialize this process by choosing the first attribute in the input file for V_new.
If there are ties in selecting maximum weight edges, use the following preference criteria:
1. Prefer edges emanating from attributes listed earlier in the input file.
2. If there are multiple maximal weight edges emanating from the first such attribute, prefer edges going to attributes listed earlier in the input file.
To root the maximal weight spanning tree, pick the first attribute in the input file as the root.

The program should be called bayes and should accept four command-line arguments as follows:

bayes
												<train-set-file> <test-set-file> <n|t>

where the last argument is a single character (either 'n' or 't') that indicates whether to use naive Bayes or TAN.

If you are using a language that is not compiled to machine code (e.g. Java), then you should make a small script called bayes that accepts the command-line arguments and invokes the appropriate source-code program and interpreter. More instructions below!

Output

Your program should determine the network structure (in the case of TAN) and estimate the model parameters using the given training set, and then classify the instances in the test set. Your program should output the following:

The structure of the Bayes net by listing one line per attribute in which you list (i) the name of the attribute, (ii) the names of its parents in the Bayes net (for naive Bayes, this will simply be the 'class' variable for each attribute) separated by whitespace.
One line for each instance in the test-set (in the same order as this file) indicating (i) the predicted class, (ii) the actual class, (iii) and the posterior probability of the predicted class.
The number of the test-set examples that were correctly classified.

You can test the correctness of your code using lymph_train.arff and lymph_test.arff. This directory contains the sample outputs (with 16 digits precision) your code should produce for each data set, and some additional files showing the intermediate calculations for TAN.

However, we will be checking your outputs for precision ONLY upto 12 digits after the decimal point. Display the posterior probability rounded upto 12 digits. The files Lymph_Naive Bayes_Rounded and Lymph_TAN_Rounded contain the output that your program should display for the lymph data set.

Part 2

Plot a learning curve for both methods. You should plot points for training set sizes of 25, 50, and 100 instances. For each training-set size, randomly draw 4 different training sets and evaluate each resulting model on the test set. For each training set size, plot the average test-set accuracy. Be sure to label the axes of your plot. Submit the learning curves in a PDF file.

Submission Instructions

Create an executable that calls your program. Instructions to create the executable can be found here.

Create a directory named <yourwiscID_hw2> . This directory should contain

Your source files in a sub-directory named <src>.
The executable shell script called 'bayes'.
The PDF file containing the learning curves.
Jar files or any other artifacts necessary to execute your code.

Compress this directory and submit the compressed zip file in canvas.

Note:

You need to ensure that your code will run, when called from the command line as described above, on the CS department Linux machines.
You WILL be penalized if your program fails to meet any of the above specifications.
Make sure to test your programs on CSL machines before you submit.
You can use third party libraries such as WEKA/arff ONLY for parsing the ARFF files. Using any other machine learning libraries for your core program is strictly prohibited.