Assume you are using the following features to represent examples:
SHAPE possible values: Circle, Ellipse, Square, Triangle
AGE possible values: Young, Old
WORTH possible values: Low, High
(Since each feature value starts with a different letter, for shorthand
we'll just use that initial letter, eg 'C' for Circle.)
Our task will be binary valued, and we'll use '+' and '-' as our category labels.
Here is our TRAIN set:
SHAPE = C AGE = Y WORTH = L CATEGORY = -
SHAPE = E AGE = O WORTH = L CATEGORY = -
SHAPE = C AGE = Y WORTH = H CATEGORY = -
SHAPE = C AGE = O WORTH = H CATEGORY = -
SHAPE = S AGE = O WORTH = H CATEGORY = +
SHAPE = E AGE = Y WORTH = L CATEGORY = +
SHAPE = E AGE = Y WORTH = H CATEGORY = +
Our TUNE set:
SHAPE = C AGE = O WORTH = L CATEGORY = -
SHAPE = S AGE = O WORTH = H CATEGORY = +
SHAPE = E AGE = O WORTH = H CATEGORY = +
SHAPE = E AGE = Y WORTH = H CATEGORY = -
SHAPE = C AGE = Y WORTH = L CATEGORY = -
And our TEST set:
SHAPE = C AGE = O WORTH = H CATEGORY = -
SHAPE = C AGE = Y WORTH = L CATEGORY = -
SHAPE = T AGE = Y WORTH = H CATEGORY = +
SHAPE = E AGE = Y WORTH = L CATEGORY = +
SHAPE = E AGE = O WORTH = H CATEGORY = +
When multiple features tie as being the best one, choose the one whose name appears earliest in alphabetical order (eg, AGE before SHAPE before WORTH). When there is a tie in computing MajorityValue, choose '-'.
Let bestTree = the tree produced by ID3 on the TRAINING set
Let bestAccuracy = the accuracy of bestTree on the TUNING set
Let progressMade = true
while (progressMade) // Continue as long as improvement on TUNING SET
{
Set progressMade = false
Let currentTree = bestTree
For each interiorNode N (including the root) in currentTree
{ // Consider various pruned versions of the current tree
// and see if any are better than the best tree found so far
Let prunedTree be a copy of currentTree,
except replace N by a leaf node
whose label equals the majority class among TRAINING set
examples that reached node N (break ties in favor of '-')
Let newAccuracy = accuracy of prunedTree on the TUNING set
// Is this pruned tree an improvement, based on the TUNE set?
// When a tie, go with the smaller tree (Occam's Razor).
If (newAccuracy >= bestAccuracy)
{
bestAccuracy = newAccuracy
bestTree = prunedTree
progressMade = true
}
}
}
return bestTree
Apply the above pruning algorithm to the tree you produced in Part 1a.
Show your work (eg, show the intermediate trees considered and their
tuning-set accuracies).
A sample dataset used in early machine-learning research is available. We've divided it into a training set and a testing set . We will not be using a tuning set in this part of the homework.
This dataset is based on actual votes on the US House of Representatives in the 1980's. The task is: given a representative's voting record on the 16 chosen bills, was the representative a Democrat or a Republican. More information is available via the UC-Irvine archive of machine-learning datasets. The voting dataset is rather old, but it is one of the few that only involves binary-valued features. Also, hopefully everyone has some intuitions about the task domain.
We have provided some code that reads the data files into some Java data structures. See BuildAndTestDecisionTree.java. You're welcome to using any or all of this provided code. The only requirement is that you create a BuildAndTestDecisionTree class, whose calling convention is as follows:
java BuildAndTestDecisionTree <trainsetFilename> <testsetFilename>See BuildAndTestDecisionTree.java for more details. Do be aware that we will test your solutions on additional datasets.
Here is what you need to do:
Run the following command:
handin -c cs540-1 -a <assignment_name> -d <directory>
Note: <assignment_name> is the name of each assignment and
<directory> the path to the directory where all your files are
located.
Put your solution in one of your directories, and run the above command to turn in your program. To make sure you do it sucessfully, go to "~cs540-1/handin/{yourName}/{assignment}" to check you have copied all the files there.
Be sure to save elsewhere a copy of the code you place in this directory, in case something gets messed up during grading. Also, once the due date has passed, do not alter your code in your handin directory or you may be charged late points.
Finally, be sure to carefully match the required names (include the use of upper and lowercase). This will greatly simply the TA's job and she can better spend her allotted hours on more important aspects of grading. After HW1, we are likely to subtract 2-3 points for misnamed files and directories.