Note: this is for Prof. Shavlik's CS 540 section.
Often in machine learning, one randomly divides his or her training data into three subsets called the training, tuning, and testing sets. (Sometimes the tuning set is call a validation or, especially in decision-tree induction, a pruning set.) One first uses the training set to initially learn, then the tuning set to address overfitting, and finally the testing set to estimate how accurately the 'tuned' result will work in the future. Usually, this entire process is repeated multiple times in order to get a statistically sound estimate of future accuracy. However, in this problem we'll only go through this process once. (Plus, we're only using an unrealistically small sample in order to keep this simple. So in Part 1 focus on the algorithm, rather than the intelligence of the results. In Part 2, we'll be using a "real world" dataset.)
Assume you are using the following features to represent examples:
SHAPE possible values: Circle, Ellipse, Square, Triangle
COLOR possible values: Red, Blue
SIZE possible values: Medium, Large, Huge
(Since each feature value starts with a different letter, for shorthand
we'll just use that initial letter, eg 'C' for Circle.)
Our task will be binary valued, and we'll use '+' and '-' as our category labels.
Here is our TRAIN set:
SHAPE = S COLOR = R SIZE = L CATEGORY = +
SHAPE = C COLOR = R SIZE = H CATEGORY = +
SHAPE = C COLOR = B SIZE = H CATEGORY = +
SHAPE = T COLOR = R SIZE = L CATEGORY = +
SHAPE = S COLOR = B SIZE = M CATEGORY = -
SHAPE = E COLOR = B SIZE = L CATEGORY = -
SHAPE = C COLOR = R SIZE = M CATEGORY = -
Our TUNE set:
SHAPE = C COLOR = B SIZE = L CATEGORY = +
SHAPE = C COLOR = B SIZE = H CATEGORY = +
SHAPE = E COLOR = R SIZE = L CATEGORY = -
SHAPE = S COLOR = R SIZE = H CATEGORY = -
SHAPE = S COLOR = R SIZE = M CATEGORY = -
And our TEST set:
SHAPE = C COLOR = B SIZE = H CATEGORY = +
SHAPE = C COLOR = R SIZE = L CATEGORY = +
SHAPE = C COLOR = B SIZE = L CATEGORY = +
SHAPE = E COLOR = B SIZE = M CATEGORY = -
SHAPE = S COLOR = R SIZE = L CATEGORY = -
When multiple features tie as being the best one, choose the one whose name appears earliest in alphabetical order (eg, COLOR before SHAPE before SIZE). When there is a tie in computing MajorityValue, choose '-'.
Let bestTree = the tree produced by ID3 on the TRAINING set
Let bestAccuracy = the accuracy of bestTree on the TUNING set
Let progressMade = true
while (progressMade) // Continue as long as improvement on TUNING SET
{
Set progressMade = false
Let currentTree = bestTree
For each interiorNode N (including the root) in currentTree
{ // Consider various pruned versions of the current tree
// and see if any are better than the best tree found so far
Let prunedTree be a copy of currentTree,
except replace N by a leaf node
whose label equals the majority class among TRAINING set
examples that reached node N (break ties in favor of '-')
Let newAccuracy = accuracy of prunedTree on the TUNING set
// Is this pruned tree an improvement, based on the TUNE set?
// When a tie, go with the smaller tree (Occam's Razor).
If (newAccuracy >= bestAccuracy)
{
bestAccuracy = newAccuracy
bestTree = prunedTree
progressMade = true
}
}
}
return bestTree
Apply the above pruning algorithm to the tree you produced in Part 1a.
Show your work (eg, show the intermediate trees considered and their
tuning-set accuracies).
We will not be using a tuning set in this part of the homework.
We are providing a 'real world' dataset for your use. It involves the disease hepatitis and the prediction is whether or not a given patient survived. We've divided it into a training set and a testing set. Our dataset is derived from this one. We converted all features to be binary-valued and we replaced missing feature values in a simplistic manner.
Use Wikepedia or an on-line search engine to find out the meaning of those features whose name is unfamilar to you. This page might help. Do image searches at your own risk :-) Do not turn in feature definitions, but knowing a little about them will help you interpret the decision trees learned. (The cs540 exams will not ask any questions about hepatitis.)
A second sample dataset, one used in early machine-learning research, is also available. We've divided it into a training set and a testing set. You should not turn in anything related to this dataset, but you might want to use it for debugging or it get more experience with decision-tree induction. (Ditto for the Titantic dataset provided for HW0.) The voting dataset is based on actual votes on the US House of Representatives in the 1980's. The task is: given a representative's voting record on the 16 chosen bills, was the representative a Democrat or a Republican? More information is available via the UC-Irvine archive of machine-learning datasets. The voting dataset is one of the few that only involves binary-valued features. Also, hopefully everyone has some intuitions about the task domain.
THE FOLLOWING JAVA FILE IS CLOSELY RELATED TO HW0 AND SO WILL NOT BE PROVIDED UNTIL AFTER CLASS FEB 6; IE, SOON AFTER THE LATEST TIME (10:50AM 2/6/13) THAT HW0 CAN BE TURNED IN.
We have provided some code that reads the data files into some Java data structures. See BuildAndTestDecisionTree.java. You're welcome to using any or all of this provided code. The only requirement is that you create a BuildAndTestDecisionTree class, whose calling convention is as follows:
java BuildAndTestDecisionTree <trainsetFilename> <testsetFilename>Note that you can provide the SAME file name for BOTH training and testing to see how well your code 'fit' the training data. Accuracy on the training set is not of much interest, but it can help during debugging.
See BuildAndTestDecisionTree.java for more details. Do be aware that we will test your solutions on additional datasets beyond the two we provided.
Here is what you need to do:
Place your written answer to Problem 1 in HW1_P1.pdf. Also turn in your commented Java code in BuildAndTestDecisionTree.java and a neatly written lab report (in HW1_P2.pdf) that includes the material requested above - eg, printed decision trees for the hepatitis testbed, as well as the test-set accuracies and names of misclassified test-set examples. Please aim to make it easy for the TA to find the requested information in what your code prints out when run. (You might want to use a "debug" flag in your code that is set to false by default but when manually set to true more information is printed.)
Be sure to briefly discuss in your lab report for Problem 2 the decision tree learned. Does it make sense? I.e., did it seem to learn something general or did it only 'memorize' the training examples? Is is reasonable it made the test-set errors it did (if any)? Do not discuss EVERY error if there are more than two; in that case, just discuss two testset errors. (Most likely you will have something like a half dozen testset errors.)