CS 760: Homework 1
Experimental Methodology, Decision Trees and k-NN


Overview

The purpose of this homework is to implement k-nearest neighbor and a decision tree learner, as well as to investigate some typical experimental methodologies used in machine learning.

Part 1: Cross Validating the Expected Future Performance

First set aside your test sets for estimating future accuracy. It is important that you never use this data until you are done tuning your learners.

Be sure you use identical datasets for each algorithm that you apply to your `personal' dataset (on this HW, as well as later HW's). Specifically, you will use 10-fold cross validation. This means you should randomly divide your `personal dataset' into ten disjoint stratified (approximately equal class ratio) subsets of (approximately) equal size. Ten times you will train on nine of these ten disjoint datasets and then test on the one left out, each time leaving out a different one.

You should create the 10 training sets and the 10 testing sets via a separate program, and then save these 20 files to disk. (Notice that it is fine for each algorithm to make its own tuning sets.) You should also be sure to cleanly separate your code for reading the *.names and *.data files into data structures from the code specifically for your HW1 algorithms, since you'll want to reuse the file-reading code for the next couple of HWs.

You may wish to create a separate file, say cv.java, that runs a complete 10-fold cross-validation file when its main class is invoked. It is legal in Java to have multiple main classes, as long as they are in separate files.

Part 2: Implementing the K Nearest-Neighbor Algorithm

Implement the K nearest-neighbor algorithm (K-NN). Design and justify (in your report) a distance function for K-NN that works with both discrete and continuous features (you don't need to design your distance function for hierarchical features).

You will use leave-one-out testing within the training set to judge which value of K is best for the nearest-neighbors algorithm. Consider at least these values for K: 1, 3, 5, 9. For each example in a training set, collect its nearest neighbors (excluding itself) and see if the correct category is predicted. Notice that it will be much more efficient to `simultaneously' compute the predicted category for each value of K, rather than looping through all the examples for each K.

This approach is called `leave one out' since we essentially create N different tuning sets, each of size one (where N is the size of the training set and, hence, N-1 is the size of the train' set).

For each `tuning' set example, collect the K nearest neighbors, and then take the most common category among these K neighbors. You may deal with breaking ties however you wish, but be sure to describe and justify your approach in your report.

Choose the K that does best in your tuning-set experiment (break any ties by choosing the largest K). After `tuning the K parameter,' categorize the corresponding TESTSET, using the complete TRAINSET as the set of possible neighbors.

Part 3: The Decision Tree Learner

Implement the algorithm in Mitchell's Table 3.1, augmenting it as explained below. You should also implement the method discussed in Section 3.7.2 for handling continuous features (some additional comments on this); you need not deal with hierarchical features on this homework.

You should implement a function called RUN_ID3(String namesFile, String trainsetFile, String splittingFunction), whose arguments are, in order, the file describing the examples, the trainset to use, and the splitting function to use. (one of: info_gain, gain_ratio).

You should also write a function DUMP_TREE() that prints, in some reasonable fashion, the most recently learned decision tree and the function REPORT_TREE_SIZE() which reports the number of interior and leaf nodes (and their sum) in the most recently learned decision tree. To prevent wasting paper, limit the maximum depth of printed trees to THREE interior nodes (including the root); wherever the tree is deeper than that, simply print something like "there are N interior nodes below this one; P positive and N negative training examples reached this node."

A decision tree that correctly classifies all the training examples might be too complicated; some subtree might generalize better to future examples. In this homework, you will implement a stochastic search for a simpler tree, according to the following (call this variant pruned):

  1. Given a training set, create a train' set and a tuning (or "pruning") set; place 20% of the examples in the tuning set.
  2. Create a tree that fully fits the train' set; use info_gain as the scoring function for features. Call it Tree. Set scoreOfBestTreeFound to the score of Tree on the tuning set.
  3. Number all the interior nodes from 1 to N (eg, using a "preorder" traversal; however the precise method for numbering nodes doesn't matter, as long as each interior node is counted once and only once).
  4. L times do the following
  5. Return the best tree found (on the tune set).
Let L=100 and K=5 (arbitrarily chosen values; if your code runs too slowly, feel free to adjust L but be sure to document this in your HW writeup - also feel free to use higher values for EM>L).

Option: if you prefer, you may instead implement one of the other approaches described in class for pruning your decision tree.

Part 4: Running Your Algorithms

Run each learning algorithm (K-NN and your tree learner) on the your training folds, and test each result on the corresponding testing set. For K-NN and the tree learner, use all ten folds. Remember that when `tuning parameters' (e.g., K),

Report the resulting test-set accuracies, as well as the mean and standard deviation for K-NN and the tree learner. Also report, for each fold, the chosen value of K.

Part 5: Judging the Statistical Significance of the Differences in Generalization Performance

You should now have two sets of ten testset accuracies. The mean accuracy for each of these sets of ten numbers provides a good indication of the expected future accuracy of each approach on your dataset. However, we'd also like to know if the differences in generalization performance are statistically significance.

Consider the following question:

We'll use a two-tailed, paired t-test to compute the probability that any differences are due to chance (i.e., our null hypothesis is that the two approaches under consideration work equally well). You can use Excel or Table 5.6 in Mitchell. Since we have ten test sets, the degrees of freedom is 9, but it is acceptable for the purposes of this homework to use the v=10 row in Table 5.6. You may do the test by hand or use a software package such as Excel.

Discuss the statistical significance of the differences you encountered in your experiments. The general rule of thumb is that a measured difference is considered significant if the probability it happened by chance is at most 5%.

Summary

What to Turn In

  1. Prepare a short report on your implementations and experiments. Be sure to provide all the information and discussions requested above. You need only briefly summarize your experimental methodology, since it is documented in this HW writeup.

  2. Hand in a hard-copy of all of the (commented) code you write.

  3. Copy your train and test sets, HW1.java, and any other files needed to create and run your program as well as your Makefile (if you have one) into the following directory:

    ~cs760-1/handin/{your login name}/HW1

    using your actual CS login name in place of {your login name}. Including a README file is encouraged. Your solution will be partially automatically graded. Our semi-automatic grading program will look for HW1.class so your code should contain a main function that takes three arguments:

    HW1 task.names train_examples.data test_examples.data
    Your program should read in the data, then run the two algorithms of this HW using the examples in train set, printing (via System.out.println) each algorithm's accuracy on the examples in the test set. Also print the selected value for K and the chosen feature subset. (Be aware that we will use our own datasets during grading.) We'll only be calling HW1 on one train/test fold during grading.