Note: this is for Prof. Shavlik's CS 540 section.
Often in machine learning, one randomly divides his or her training data into three subsets called the training, tuning, and testing sets. (Sometimes the tuning set is called a validation or, especially in decision-tree induction, a pruning set.) One first uses the training set to initially learn, then the tuning set to address overfitting, and finally the testing set to estimate how accurately the 'tuned' result will work in the future. Usually, this entire process is repeated multiple times in order to get a statistically sound estimate of future accuracy. However, in this problem we will only address training and testing, and we will only go through this process once. (Plus, we're only using an unrealistically small sample in order to keep this simple. So in Part 1 focus on the algorithm, rather than the intelligence of the results. In Part 3, we'll be using a "real world" dataset.)
In lecture we will cover using a tuning set to choose a good pruned tree, in order to reduce the odds of overfitting. This is a topic that might be on the midterm or final exam, so we recommend you create a tuning set for this task and decide which pruned tree would be chosen. Do not turn in anything about tree pruning for grading, though.
Assume you are using the following features to represent examples:
COLOR possible values: Red, Green, Blue AGE possible values: Young, Old WEIGHT possible values: Light, Medium, Heavy(Since each feature value starts with a different letter, for shorthand we'll just use that initial letter, eg 'R' for Red.)
Our task will be binary valued, and we'll use '+' and '-' as our category labels.
Here is our TRAIN set:
COLOR = R AGE = O WEIGHT = H CATEGORY = + COLOR = B AGE = Y WEIGHT = L CATEGORY = + COLOR = G AGE = Y WEIGHT = L CATEGORY = + COLOR = R AGE = Y WEIGHT = H CATEGORY = + COLOR = G AGE = O WEIGHT = L CATEGORY = - COLOR = G AGE = Y WEIGHT = L CATEGORY = - COLOR = B AGE = O WEIGHT = H CATEGORY = -
And our TEST set:
COLOR = B AGE = Y WEIGHT = H CATEGORY = + COLOR = G AGE = Y WEIGHT = L CATEGORY = + COLOR = B AGE = O WEIGHT = L CATEGORY = - COLOR = R AGE = Y WEIGHT = M CATEGORY = - COLOR = R AGE = O WEIGHT = L CATEGORY = -
When multiple features tie as being the best one, choose the one whose name appears earliest in alphabetical order (eg, AGE before COLOR before WEIGHT). When there is a tie in computing MajorityValue, choose '-'.
We will not be using a tuning set in this part of the homework. In HW2 we will investigate an alternate method for reducing overfitting in decision trees.
Recall the 'real world' dataset about wine from HW0. It involves predicting whether or not a given wine is highly rated. We have divided it into a training set and a testing set.
A second sample dataset, one used in early machine-learning research, is also available. We have divided it into a training set and a testing set. You should not turn in anything related to this dataset, but you might want to use it for debugging or to get more experience with decision-tree induction. (Ditto for the Titantic dataset provided for HW0.) The voting dataset is based on actual votes on the US House of Representatives in the 1980's. The task is: given a representative's voting record on the 16 chosen bills, was the representative a Democrat or a Republican? More information is available via the UC-Irvine archive of machine-learning datasets. The voting dataset is one of the few that only involves binary-valued features. Also, hopefully everyone has some intuitions about the task domain.
We also recommend you create 2-3 simple and 'meaning-free' datasets for debugging, ones where you can compute by hand the correct answer (you might even want to use your code to check your answer to Problem 1 of this homework!). By 'meaning free,' we suggest you simply call the features F1, F2, etc. Also consider looking at some old CS540 midterm exams for simple datasets.
We have provided some code (will be released 9/15/16) that reads the data files into some Java data structures. PLEASE DO NOT LOOK AT THIS FILE UNTIL YOU HAVE TURNED IN YOUR HW0. See BuildAndTestDecisionTree.java. You're welcome to use any or all of this provided code. The only requirement is that you create a BuildAndTestDecisionTree class, whose calling convention is as follows:
java BuildAndTestDecisionTree <trainsetFilename> <testsetFilename>Note that you can provide the SAME file name for BOTH training and testing to see how well your code 'fit' the training data (it should get them all correct except for the 'extreme noise' case that was discussed in class). Accuracy on the training set is not of much interest, but it can help during debugging.
See BuildAndTestDecisionTree.java for more details. Do be aware that we will test your solutions on additional datasets beyond those we provided.
Here is what you need to do:
Place your written answer to Problems 1 and 2 in HW1_P1_P2.pdf. Also turn in your commented Java code in BuildAndTestDecisionTree.java and a neatly written lab report (in HW1_P2.pdf) that includes the material requested above - e.g., printed decision tree for the wine testbed, as well as the test-set accuracies and names of misclassified test-set examples. Please aim to make it easy for the TA to find the requested information in what your code prints out when run. (You might want to use a "debug" flag in your code that is set to false by default but when manually set to true, more information is printed.)
Be sure to briefly discuss in your lab report for Problem 3 the decision tree learned on the wine dataset. Does it make sense? I.e., did it seem to learn something general or did it only 'memorize' the training examples? Is is reasonable it made the test-set errors it did (if any)? Do not discuss EVERY error if there are more than two; in that case, just discuss two testset errors, one 'false positive' (a negative example called positive by the decision tree) and one 'false negative' (a positive example called negative by the learned decision tree). Most likely you will have more than one testset error of each type. If not, discuss any two errors.