Note: this is for Prof. Shavlik's CS 540 section.
Assume you are using the following features to represent examples:
SHAPE possible values: Circle, Ellipse, Square, Triangle
COLOR possible values: Red, Blue
SIZE possible values: Medium, Large, Huge
(Since each feature value starts with a different letter, for shorthand
we'll just use that initial letter, eg 'C' for Circle.)
Our task will be binary valued, and we'll use '+' and '-' as our category labels.
Here is our TRAIN set:
SHAPE = S COLOR = R SIZE = L CATEGORY = +
SHAPE = S COLOR = R SIZE = M CATEGORY = -
SHAPE = C COLOR = B SIZE = H CATEGORY = +
SHAPE = E COLOR = R SIZE = H CATEGORY = -
SHAPE = S COLOR = B SIZE = M CATEGORY = -
SHAPE = E COLOR = B SIZE = L CATEGORY = -
SHAPE = C COLOR = R SIZE = H CATEGORY = +
Our TUNE set:
SHAPE = E COLOR = R SIZE = M CATEGORY = -
SHAPE = S COLOR = B SIZE = H CATEGORY = +
SHAPE = E COLOR = R SIZE = H CATEGORY = +
SHAPE = C COLOR = R SIZE = M CATEGORY = -
SHAPE = S COLOR = B SIZE = L CATEGORY = +
And our TEST set:
SHAPE = S COLOR = B SIZE = H CATEGORY = +
SHAPE = C COLOR = R SIZE = L CATEGORY = +
SHAPE = T COLOR = R SIZE = M CATEGORY = -
SHAPE = E COLOR = B SIZE = M CATEGORY = -
SHAPE = E COLOR = B SIZE = H CATEGORY = -
When multiple features tie as being the best one, choose the one whose name appears earliest in alphabetical order (eg, COLOR before SHAPE before SIZE). When there is a tie in computing MajorityValue, choose '-'.
Let bestTree = the tree produced by ID3 on the TRAINING set
Let bestAccuracy = the accuracy of bestTree on the TUNING set
Let progressMade = true
while (progressMade) // Continue as long as improvement on TUNING SET
{
Set progressMade = false
Let currentTree = bestTree
For each interiorNode N (including the root) in currentTree
{ // Consider various pruned versions of the current tree
// and see if any are better than the best tree found so far
Let prunedTree be a copy of currentTree,
except replace N by a leaf node
whose label equals the majority class among TRAINING set
examples that reached node N (break ties in favor of '-')
Let newAccuracy = accuracy of prunedTree on the TUNING set
// Is this pruned tree an improvement, based on the TUNE set?
// When a tie, go with the smaller tree (Occam's Razor).
If (newAccuracy >= bestAccuracy)
{
bestAccuracy = newAccuracy
bestTree = prunedTree
progressMade = true
}
}
}
return bestTree
Apply the above pruning algorithm to the tree you produced in Part 1a.
Show your work (eg, show the intermediate trees considered and their
tuning-set accuracies).
We will not be using a tuning set in this part of the homework.
We are providing two 'real world' datasets for your use. One involves predicting who survived the sinking of the Titantic and only has a few features. We've divided it into a training set and a testing set.
A second sample dataset, one used in early machine-learning research, is also available. We've divided it into a training set and a testing set. This dataset is based on actual votes on the US House of Representatives in the 1980's. The task is: given a representative's voting record on the 16 chosen bills, was the representative a Democrat or a Republican? More information is available via the UC-Irvine archive of machine-learning datasets. The voting dataset is rather old, but it is one of the few that only involves binary-valued features. Also, hopefully everyone has some intuitions about the task domain.
We have provided some code that reads the data files into some Java data structures. See BuildAndTestDecisionTree.java. You're welcome to using any or all of this provided code. The only requirement is that you create a BuildAndTestDecisionTree class, whose calling convention is as follows:
java BuildAndTestDecisionTree <trainsetFilename> <testsetFilename>See BuildAndTestDecisionTree.java for more details. Do be aware that we will test your solutions on additional datasets.
Here is what you need to do:
Run the following command:
handin -c cs540-2 -a <assignment_name> -d <directory>
Note: <assignment_name> is the name of each assignment (for this homework it is HW1)
and <directory> is the path to the directory where all your files are located.
Put your solution in one of your directories, and run the above command to turn in your program. To make sure you do it sucessfully, go to "~cs540-2/handin/{yourName}/{assignment}" to check you have copied all the files there.
Be sure to save elsewhere a copy of the code you place in this directory, in case something gets messed up during grading. Also, once the due date has passed, do not alter your code in your handin directory or you may be charged late points.
Finally, be sure to carefully match the required names (include the use of upper and lowercase). This will greatly simply the TA's job and he can better spend his allotted hours on more important aspects of grading. After HW1, we are likely to subtract 2-3 points for misnamed files and directories.
Be sure to briefly discuss the two decision-trees learned. Do they make sense? I.e., did they seem to learn something general or did they only 'memorize' the training examples? Is is reasonable they made the test-set errors they made (if any)? No need to discuss EVERY error; discuss at most two testset errors per dataset.