Be sure you use identical datasets for each algorithm that you apply to your `personal' dataset (on this HW, as well as later HW's). Specifically, you will use 10-fold cross validation. This means you should randomly divide your `personal dataset' into ten disjoint stratified (approximately equal class ratio) subsets of (approximately) equal size. Ten times you will train on nine of these ten disjoint datasets and then test on the one left out, each time leaving out a different one.
You should create the 10 training sets and the 10 testing sets via a separate program, and then save these 20 files to disk. (Notice that it is fine for each algorithm to make its own tuning sets.) You should also be sure to cleanly separate your code for reading the *.names and *.data files into data structures from the code specifically for your HW1 algorithms, since you'll want to reuse the file-reading code for the next couple of HWs.
You may wish to create a separate file, say cv.java
,
that runs a complete 10-fold cross-validation file when its
main
class is invoked. It is legal in Java to
have multiple main
classes, as long as they are in
separate files.
You will use leave-one-out testing within the training set to judge which value of K is best for the nearest-neighbors algorithm. Consider at least these values for K: 1, 3, 5, 9. For each example in a training set, collect its nearest neighbors (excluding itself) and see if the correct category is predicted. Notice that it will be much more efficient to `simultaneously' compute the predicted category for each value of K, rather than looping through all the examples for each K.
This approach is called `leave one out' since we essentially create N different tuning sets, each of size one (where N is the size of the training set and, hence, N-1 is the size of the train' set).
For each `tuning' set example, collect the K nearest neighbors, and then take the most common category among these K neighbors. You may deal with breaking ties however you wish, but be sure to describe and justify your approach in your report.
Choose the K that does best in your tuning-set experiment (break any ties by choosing the largest K). After `tuning the K parameter,' categorize the corresponding TESTSET, using the complete TRAINSET as the set of possible neighbors.
You should implement a function called RUN_ID3(String namesFile, String trainsetFile, String splittingFunction), whose arguments are, in order, the file describing the examples, the trainset to use, and the splitting function to use. (one of: info_gain, gain_ratio).
You should also write a function DUMP_TREE() that prints, in some reasonable fashion, the most recently learned decision tree and the function REPORT_TREE_SIZE() which reports the number of interior and leaf nodes (and their sum) in the most recently learned decision tree. To prevent wasting paper, limit the maximum depth of printed trees to THREE interior nodes (including the root); wherever the tree is deeper than that, simply print something like "there are N interior nodes below this one; P positive and N negative training examples reached this node."
A decision tree that correctly classifies all the training examples might be too complicated; some subtree might generalize better to future examples. In this homework, you will implement a stochastic search for a simpler tree, according to the following (call this variant pruned):
Option: if you prefer, you may instead implement one of the other approaches described in class for pruning your decision tree.
Report the resulting test-set accuracies, as well as the mean and standard deviation for K-NN and the tree learner. Also report, for each fold, the chosen value of K.
Consider the following question:
We'll use a two-tailed, paired t-test to compute the probability that any differences are due to chance (i.e., our null hypothesis is that the two approaches under consideration work equally well). You can use Excel or Table 5.6 in Mitchell. Since we have ten test sets, the degrees of freedom is 9, but it is acceptable for the purposes of this homework to use the v=10 row in Table 5.6. You may do the test by hand or use a software package such as Excel.
Discuss the statistical significance of the differences you encountered in your experiments. The general rule of thumb is that a measured difference is considered significant if the probability it happened by chance is at most 5%.
using your actual CS login name in place of {your login name}. Including a README file is encouraged. Your solution will be partially automatically graded. Our semi-automatic grading program will look for HW1.class so your code should contain a main function that takes three arguments:
HW1 task.names train_examples.data test_examples.data
Your program should read in the data, then run the two
algorithms of this HW using the examples in train set,
printing (via System.out.println
)
each algorithm's accuracy on the examples in the test set.
Also print the selected value for K and the chosen feature subset.
(Be aware that we will use our own datasets during grading.)
We'll only be calling HW1
on
one train/test fold during grading.