Handling Continuous Features in HW 2
Here's an inefficient but easy way to handle continuous features in ID3:
-
Before starting the ID3 calc's for the root node,
SORT each continuous feature in the current training set
and make a list of the BOUNDARY VALUES (ie, those values halfway
between two adjacent examples of different classes).
-
In each cycle of the decision tree, score
each of these boundary values (values used
higher up in the path to the current node don't
need to be rechecked nor do those between examples
no longer in the current set of training examples, but this
inefficiency is ok).
You really should do the above on each recursive call of ID3, but if
that is too complicated doing the above "brute force" method is ok (I
think both methods will pick the same thresholds, possibly modulo
tie-breaking).
For the "random" splitting function, treat a continuous feature
as a single feature in terms of its chances of being picked. If
picked, be sure to select a threshold for which there are examples on both sides
of the threshold (to prevent infinite loops). Give all such thresholds (i.e.,
those the split the difference between the values for trainset examples, regardless
of category of the example since this is supposed to be random) an equal chance
of being selected.