CS 760: Homework 1
Experimental Methodology, Feature Selection, k-NN and Naive Bayes
- Assigned: Monday, February 1, 2010.
- Due: Wednesday, February 17, 2010.
- Worth: 150 points
Overview
The primary purpose of this homework is to investigate some typical experimental
methodologies used in machine learning. For simplicity, you will use
nearest neighbor and Naive Bayes as the learning algorithms.
You will learn how to use different cross validation techniques
to estimate parameters for learning algorithms as well as to determine
how well the learning algorithms will perform on new datasets. You will also
learn how you might be able to improve the performance of the learning
algorithms by doing feature selection. Finally, you will use a
statistical method to judge the statistical significance of the performance
between two different learning algorithms.
Part 1: Cross Validating the Expected Future Performance
First set aside your test sets for estimating future accuracy.
It is important that you never look at this data until you are done
tuning your learners.
Be sure you use identical datasets for each algorithm that
you apply to your `personal' dataset (on this HW, as well as later HW's).
Specifically, you will use 10-fold cross validation. This means
you should randomly divide your `personal dataset' into ten disjoint
subsets of (approximately) the equal size.
Ten times you will train on nine of these ten disjoint datasets and
then test on the one left out, each time leaving out a different one.
You should create the 10 training sets and the 10 testing sets
via a separate program, and then save these 20 files to disk.
(Notice that it is fine for each algorithm to make its own
tuning sets.) You should also be sure to cleanly separate
your code for reading the *.names and *.data files into data structures
from the code specifically for your HW1 algorithms,
since you'll want to reuse the file-reading
code for the next couple of HWs.
You may wish to create a separate file, say cv.java,
that runs a complete 10-fold cross-validation file when its
main class is invoked. It is legal in Java to
have multiple main classes, as long as they are in
separate files.
Part 2: Implementing the k Nearest-Neighbor and Naive Bayes Approaches
Implement the k nearest-neighbor algorithm (k-NN)
and the Naive Bayes algorithm.
Design and justify (in your report) a distance function for k-NN
that works with both discrete and continuous features, as well as hierarchical
features (though you don't need to implement your distance function
for hierarchical features).
For Naive Bayes,
use Equation 6.22 in Mitchell's texbook, with p set as he
describes and let m=30 (to pick an arbitrary value).
Design and justify (in your report)
prob(feature_i = value_j | category)
functions that work with discrete, continuous, and hierarchical features
(again, no need to implement code for the hierarchical case).
Part 3: Creating a Learning Curve for Naive Bayes
Using train/test fold #1, create for Naive Bayes what is called a learning curve.
This is a graph where the X-axis is the number
of training examples and the Y-axis is the accuracy on the test set
(i.e., the estimated future accuracy as a function of the amount of training data).
To create this graph, randomize the order of your training examples (you
only need to do this once; the results you get will depend
on the order of the examples and it would be better to repeat this process
multiple times to get 'error bars' - however, we will not do so
in this homework). Create a model using the first 100 training examples,
measure the resulting accuracy on the test set, then repeat using the first 200, 300, ..., 900
training examples (if you have fewer than 1000 examples, scale these number proportionally).
Neatly present and discuss your learning curve.
Part 4: Selecting a Special k
You will use leave-one-out testing within the training set
to judge which value of
k is best for the nearest-neighbors algorithm.
Consider at least these values for k: 1, 3, 5, 15, 25, and 51.
For each example in a training set, collect its nearest neighbors
(excluding itself) and see if the correct category is predicted.
Notice that it will be much more efficient to `simultaneously' compute
the predicted category for each value of K, rather than
looping through all the examples for each K.
This approach is called `leave one out' since we essentially
create N different tuning sets, each of size one
(where N is the size of the training set
and, hence, N-1 is the size of the train' set).
For each `tuning' set example, collect the K nearest neighbors,
and then take
the most common category among these k neighbors.
You may deal with breaking ties however you wish, but be sure
to describe and justify your approach in your report.
Choose the k that does best in your tuning-set experiment
(break any ties by choosing the largest k). After
`tuning the K parameter,' categorize the corresponding
TESTSET, using the complete TRAINSET as the set of possible neighbors.
Part 5: Using Backward Selection to Select Good Features for k-NN
Implement the `backward selection' greedy (i.e., hill-climbing) search algorithm described in Lecture 4
for choosing a good subset of features, and `wrap' this algorithm
about your K-NN system. To keep the amount of computation needed low,
only use folds #1-3
and the best values for K for each fold that you found in Part 4 above.
Set aside (just once) a randomly chosen 20%
of your training examples for each fold and use
this `tuning' set to score the candidate feature subsets. (I.e., using the candidate
set of features and the remaining 80% of trainset 1, compute the accuracy on the tuning set;
this is the heuristic function used in the hill-climbing search
through the huge space of possible feature subsets.) Notice that
you are using a different method for creating the tuning set
than you used in Part 4; we are doing this to save cpu cycles as well
as to see a different way of creating tuning sets.
Depending on the candidate
feature set being scored, you'll only `look at' some of the
features when performing the distance calculation.
Note that it makes more sense to perform feature selection
with instance-based learning than for Naive Bayes, since Naive Bayes does
a `soft' form of feature selection already when it computes the ratio
prob(f1=value | pos example) / prob(f1=value | neg example)
For irrelevant features this ratio will be about 1, which has little impact in a product.
Part 6: Running Your Algorithms
Run each learning algorithm (k-NN, Naive Bayes, and k-NN with
feature selection)
on the your training folds, and test each result on the corresponding
testing set. For k-NN and Naive Bayes, use all ten folds. As mentioned, for
the feature-selection runs, you only need to use the first three folds.
Remember that when `tuning parameters' (e.g., k or the features to use),
you separately set the parameters for each fold.
(Notice that we could have jointly selected k and the features to use,
but for cpu-efficiency reasons we choose the setting for one parameter first then
`froze' that choice when selecting a good setting for the other parameter.)
Report the resulting test-set accuracies, as well as
the mean and standard deviation for k-NN and Naive Bayes.
Also report, for each fold, the chosen value of k
and (for folds #1-3) the chosen feature subset.
Part 7: Judging the Statistical Significance of the Differences
in Generalization Performance
You should now have two sets of ten generalization (i.e.,
accuracy on a testset) numbers. The mean accuracy for each of these
sets of ten numbers provides a good indication of the expected future
accuracy of each approach on your dataset. However, we'd also like to
know if the differences in generalization performance are
statistically significance.
Consider the following question:
- Does k-NN work better than Naive Bayes on your `personal dataset?'
We'll use a paired t-test to compute the
probability that any differences are due to chance (i.e., our null
hypothesis is that the two approaches under consideration work
equally well). You can use Table 5.6 in Mitchell. Since we have ten
test sets, the degrees of freedom is 9, but it is acceptable
for the purposes of this homework to use the v=10 row in
Table 5.6. Show and explain your work for this calculation.
Discuss the statistical significance of the differences you encountered in
your experiments. The general rule of thumb is that a measured difference
is considered significant if the probability it happened by chance is
at most 5%.
Summary
- Divide your dataset into train and test sets for 10-fold cross validation.
- Implement k-NN with leave-one-out testing within the training set to judge which value of k is best for the nearest-neighbors algorithm.
- Implement Naive Bayes.
- Create a learning curve for Naive Bayes.
- Implement k-NN with feature selection.
- Judge statistical significance between k-NN performance and Naive
Bayes performance using two-tailed, paired t-test.
What to Turn In
- Prepare a short report on your implementations and experiments.
Be sure to provide all the information requested above and discuss all your experiments.
You need only briefly summarize your experimental methodology, since it is
documented in this HW writeup.
- Hand in a hard-copy of all of the (commented) code you write.
- Copy your train and test sets, HW1.java, and any
other files needed to create and run your program
as well as your Makefile (if you have one) into the following directory:
~cs760-1/handin/{your login name}/HW1
using your actual CS login name in place of {your login name}.
Including a README file is encouraged.
Your solution will be partially automatically graded. Our semi-automatic
grading program will look for HW1.class so
your code should contain a main function that
takes three arguments:
HW1 task.names train_examples.data test_examples.data
Your program should read in the data, then run the three
algorithms of this HW using the examples in train set,
printing (via System.out.println)
each algorithm's accuracy on the examples in the test set.
Also print the selected value for K and the chosen feature subset.
(Be aware that we will use our own datasets during grading.)
We'll only be calling HW1 on
one train/test fold during grading.