For the TAN algorithm. Your program should:
Your program should read files that are in the ARFF format. In this format, each instance is described on a single line. The feature values are separated by commas, and the last value on each line is the class label of the instance. Each ARFF file starts with a header section describing the features and the class labels. Lines starting with '%' are comments. See the link above for a brief, but more detailed description of the ARFF format. Your program needs to handle only discrete attributes, and simple ARFF files (i.e. don't worry about sparse ARFF files and instance weights). Example ARFF files are provided below. Your program can assume that the class attribute is named 'class' and it is the last attribute listed in the header section.
The program should be called bayes
and should accept four
command-line arguments as follows:
bayes
<train-set-file> <test-set-file> <n|t>
where the last argument is a single character (either 'n' or 't') that indicates whether to use naive Bayes or TAN.
If you are using
a language that is not compiled to machine code (e.g. Java), then you
should make a small script called bayes
that accepts the
command-line arguments and invokes the appropriate source-code program
and interpreter. More instructions below!
Your program should determine the network structure (in the case of TAN) and estimate the model parameters using the given training set, and then classify the instances in the test set. Your program should output the following:
You can test the correctness of your code using lymph_train.arff and lymph_test.arff.
This directory contains the outputs your code should produce for each data set, and some additional files showing the intermediate calculations for TAN.
Edit :
We will be checking your outputs for precision upto 12 digits after the decimal point. Display the posterior probability rounded upto 12 digits. The files Lymph_Naive Bayes_Rounded and Lymph_TAN_Rounded contain the output that your program should display for the lymph data set.
Plot a learning curve for both methods. You should plot points for
training set sizes of 25, 50, and 100 instances. For each
training-set size, randomly draw 4 different training sets and
evaluate each resulting model on the test set. For each training set
size, plot the average test-set accuracy. Be sure to label the axes
of your plot. Submit the learning curves in a PDF file.
<Wisc username>_HW2.zip
.
You need to ensure that your code will run, when called from the command line as described above, on the CS department Linux machines.