University of Wisconsin Computer Sciences Header Map (repeated with 
textual links if page includes departmental footer) Useful Resources Research at UW-Madison CS Dept UW-Madison CS Undergraduate Program UW-Madison CS Graduate Program UW-Madison CS People Useful Information Current Seminars in the CS Department Search Our Site UW-Madison CS Computer Systems Laboratory UW-Madison Computer Sciences Department Home Page UW-Madison Home Page

CS 760: Homework 1
Experimental Methodology, Feature Selection, k-NN and Naive Bayes


Overview

The primary purpose of this homework is to investigate some typical experimental methodologies used in machine learning. For simplicity, you will use nearest neighbor and Naive Bayes as the learning algorithms. You will learn how to use different cross validation techniques to estimate parameters for learning algorithms as well as to determine how well the learning algorithms will perform on new datasets. You will also learn how you might be able to improve the performance of the learning algorithms by doing feature selection. Finally, you will use a statistical method to judge the statistical significance of the performance between two different learning algorithms.

Part 1: Cross Validating the Expected Future Performance

First set aside your test sets for estimating future accuracy. It is important that you never look at this data until you are done tuning your learners.

Be sure you use identical datasets for each algorithm that you apply to your `personal' dataset (on this HW, as well as later HW's). Specifically, you will use 10-fold cross validation. This means you should randomly divide your `personal dataset' into ten disjoint subsets of (approximately) the equal size. Ten times you will train on nine of these ten disjoint datasets and then test on the one left out, each time leaving out a different one.

You should create the 10 training sets and the 10 testing sets via a separate program, and then save these 20 files to disk. (Notice that it is fine for each algorithm to make its own tuning sets.) You should also be sure to cleanly separate your code for reading the *.names and *.data files into data structures from the code specifically for your HW1 algorithms, since you'll want to reuse the file-reading code for the next couple of HWs.

You may wish to create a separate file, say cv.java, that runs a complete 10-fold cross-validation file when its main class is invoked. It is legal in Java to have multiple main classes, as long as they are in separate files.

Part 2: Implementing the k Nearest-Neighbor and Naive Bayes Approaches

Implement the k nearest-neighbor algorithm (k-NN) and the Naive Bayes algorithm.

Design and justify (in your report) a distance function for k-NN that works with both discrete and continuous features, as well as hierarchical features (though you don't need to implement your distance function for hierarchical features).

For Naive Bayes, use Equation 6.22 in Mitchell's texbook, with p set as he describes and let m=30 (to pick an arbitrary value). Design and justify (in your report)

prob(feature_i = value_j | category)

functions that work with discrete, continuous, and hierarchical features (again, no need to implement code for the hierarchical case).

Part 3: Creating a Learning Curve for Naive Bayes

Using train/test fold #1, create for Naive Bayes what is called a learning curve. This is a graph where the X-axis is the number of training examples and the Y-axis is the accuracy on the test set (i.e., the estimated future accuracy as a function of the amount of training data). To create this graph, randomize the order of your training examples (you only need to do this once; the results you get will depend on the order of the examples and it would be better to repeat this process multiple times to get 'error bars' - however, we will not do so in this homework). Create a model using the first 100 training examples, measure the resulting accuracy on the test set, then repeat using the first 200, 300, ..., 900 training examples (if you have fewer than 1000 examples, scale these number proportionally).

Neatly present and discuss your learning curve.

Part 4: Selecting a Special k

You will use leave-one-out testing within the training set to judge which value of k is best for the nearest-neighbors algorithm. Consider at least these values for k: 1, 3, 5, 15, 25, and 51. For each example in a training set, collect its nearest neighbors (excluding itself) and see if the correct category is predicted. Notice that it will be much more efficient to `simultaneously' compute the predicted category for each value of K, rather than looping through all the examples for each K.

This approach is called `leave one out' since we essentially create N different tuning sets, each of size one (where N is the size of the training set and, hence, N-1 is the size of the train' set).

For each `tuning' set example, collect the K nearest neighbors, and then take the most common category among these k neighbors. You may deal with breaking ties however you wish, but be sure to describe and justify your approach in your report.

Choose the k that does best in your tuning-set experiment (break any ties by choosing the largest k). After `tuning the K parameter,' categorize the corresponding TESTSET, using the complete TRAINSET as the set of possible neighbors.

Part 5: Using Backward Selection to Select Good Features for k-NN

Implement the `backward selection' greedy (i.e., hill-climbing) search algorithm described in Lecture 4 for choosing a good subset of features, and `wrap' this algorithm about your K-NN system. To keep the amount of computation needed low, only use folds #1-3 and the best values for K for each fold that you found in Part 4 above. Set aside (just once) a randomly chosen 20% of your training examples for each fold and use this `tuning' set to score the candidate feature subsets. (I.e., using the candidate set of features and the remaining 80% of trainset 1, compute the accuracy on the tuning set; this is the heuristic function used in the hill-climbing search through the huge space of possible feature subsets.) Notice that you are using a different method for creating the tuning set than you used in Part 4; we are doing this to save cpu cycles as well as to see a different way of creating tuning sets. Depending on the candidate feature set being scored, you'll only `look at' some of the features when performing the distance calculation.

Note that it makes more sense to perform feature selection with instance-based learning than for Naive Bayes, since Naive Bayes does a `soft' form of feature selection already when it computes the ratio

prob(f1=value | pos example) / prob(f1=value | neg example)

For irrelevant features this ratio will be about 1, which has little impact in a product.

Part 6: Running Your Algorithms

Run each learning algorithm (k-NN, Naive Bayes, and k-NN with feature selection) on the your training folds, and test each result on the corresponding testing set. For k-NN and Naive Bayes, use all ten folds. As mentioned, for the feature-selection runs, you only need to use the first three folds. Remember that when `tuning parameters' (e.g., k or the features to use), you separately set the parameters for each fold. (Notice that we could have jointly selected k and the features to use, but for cpu-efficiency reasons we choose the setting for one parameter first then `froze' that choice when selecting a good setting for the other parameter.)

Report the resulting test-set accuracies, as well as the mean and standard deviation for k-NN and Naive Bayes. Also report, for each fold, the chosen value of k and (for folds #1-3) the chosen feature subset.

Part 7: Judging the Statistical Significance of the Differences in Generalization Performance

You should now have two sets of ten generalization (i.e., accuracy on a testset) numbers. The mean accuracy for each of these sets of ten numbers provides a good indication of the expected future accuracy of each approach on your dataset. However, we'd also like to know if the differences in generalization performance are statistically significance.

Consider the following question:

We'll use a paired t-test to compute the probability that any differences are due to chance (i.e., our null hypothesis is that the two approaches under consideration work equally well). You can use Table 5.6 in Mitchell. Since we have ten test sets, the degrees of freedom is 9, but it is acceptable for the purposes of this homework to use the v=10 row in Table 5.6. Show and explain your work for this calculation.

Discuss the statistical significance of the differences you encountered in your experiments. The general rule of thumb is that a measured difference is considered significant if the probability it happened by chance is at most 5%.

Summary

What to Turn In

  1. Prepare a short report on your implementations and experiments. Be sure to provide all the information requested above and discuss all your experiments. You need only briefly summarize your experimental methodology, since it is documented in this HW writeup.

  2. Hand in a hard-copy of all of the (commented) code you write.

  3. Copy your train and test sets, HW1.java, and any other files needed to create and run your program as well as your Makefile (if you have one) into the following directory:

    ~cs760-1/handin/{your login name}/HW1

    using your actual CS login name in place of {your login name}. Including a README file is encouraged. Your solution will be partially automatically graded. Our semi-automatic grading program will look for HW1.class so your code should contain a main function that takes three arguments:

    HW1 task.names train_examples.data test_examples.data
    Your program should read in the data, then run the three algorithms of this HW using the examples in train set, printing (via System.out.println) each algorithm's accuracy on the examples in the test set. Also print the selected value for K and the chosen feature subset. (Be aware that we will use our own datasets during grading.) We'll only be calling HW1 on one train/test fold during grading.