CS 760 - Machine Learning
Homework 2
Due Monday, March 15, 2009
125 points
Inducing Decision Trees - ID3
In this homework you will investigate Quinlan's ID3 learning algorithm
(described in Mitchell's Chapter 3). Your assignment has four parts.
First, you will implement ID3, including a (possibly stochastic) pruning algorithm.
Second, you will run some experiments with ID3.
Third, you will implement and evaluate an ensemble method called random forests.
Fourth, you will plot results in ROC and recall-precision curves.
It is acceptable to look at the Java code available from
WEKA.
However, it is not acceptable to share code
with current or former cs760 students (be sure
that you have read the Academic Misconduct policy on the class homepage).
Part 1: Implementing ID3
Implement the algorithm in Mitchell's Table 3.1, augmenting it as explained below.
You should also implement the method discussed in Section 3.7.2 for
handling continuous features
(some additional comments on this);
you need not deal with
hierarchical features on this homework.
If your HW0 dataset has features with lots of possible values (say more than 10),
your code should make N Boolean-valued features out of such features
(see the course notes). The datasets we used for grading will
not have more than 10 possible values.
Part 1a: Handling Additional Splitting Functions
The crucial step in the ID3 algorithm involves choosing which feature to use
as the next node in the decision tree. Besides using Quinlan's info_gain
measure (Equation 3.4), implement one or more alternatives for use as experimental
controls:
- random
- (uniformly) randomly choosing one of the remaining features
- accuracy
(This one is optional, but you might want to experiment with it - note, though, that
no extra credit will be given for experimenting with this feature-selection method)
-
choosing the feature that has the least errors on the
current set of examples, where "impure" leaf nodes use
the majority category to "model" their data
(e.g., if 40 positive and 10 negatives
follow the left branch and 15 positives and 35 negatives
follow the right branch, then "guessing the majority"
will produce 25 total errors - 10 on the left branch and
15 on the right branch)
You should implement a function called
RUN_ID3(String namesFile, String trainsetFile, String splittingFunction),
whose arguments are, in order, the file describing the examples,
the trainset to use, and the splitting function to use.
(one of: info_gain, random, accuracy).
You should also write a function DUMP_TREE() that prints, in
some reasonable fashion, the most recently learned decision tree and
the function REPORT_TREE_SIZE() which reports the number of
interior and leaf nodes (and their sum) in the most recently learned
decision tree. To prevent wasting paper, limit the maximum depth
of printed trees to THREE interior nodes (including the root);
wherever the tree is deeper than that, simply print something
like "there are N interior nodes below this one; P positive and N negative training
examples reached this node."
Part 1b: Avoiding Overfitting
A decision tree that correctly classifies all the training examples might
be too complicated; some subtree might generalize better to future examples.
In this homework, you will implement a stochastic search for
a simpler tree, according to the following (call this variant pruned):
- Given a training set, create a train' set and a tuning (or "pruning") set;
place 20% of the examples in the tuning set.
- Create a tree that fully fits the train' set; use info_gain as the scoring function
for features. Call it Tree.
Set scoreOfBestTreeFound to the score of Tree on the tuning set.
- Number all the interior nodes from 1 to N (eg, using a
"preorder" traversal;
however the precise method for numbering nodes doesn't
matter, as long as each interior node is counted once and only once).
- L times do the following
- Make a copy of Tree. Call it CopiedTree.
- Uniformly pick a random number, R, between 1 and K.
- R times, uniformly pick a random number, D, between 1 and N.
Mark node D in CopiedTree as "to be deleted."
- Make a new copy of CopiedTree, this time pruning the tree
whenever you encounter a "to be deleted" node.
Replace the deleted node by the majority category for the subtree
rooted at this node. CopiedandPrunedTree should point to this
pruned tree.
- At this point, you will have a random subtree of Tree.
Score it on the tuning set, and keep track of the best tree found.
- Return the best tree found (on the tune set).
Let L=100 and K=5 (arbitrarily chosen values; if your code runs
too slowly, feel free to adjust L but be sure to document this in your HW writeup - also feel
free to use higher values for L).
Option: if you prefer, you may instead implement the greedy algorithm presented in class for finding a good decision tree
based on tuning-set accuracy.
Part 1c: Making Decisions
You are to also create the function CATEGORIZE(testsetFile). This
function takes as input a string indicating which testset to use. For each
example in this list, the code should traverse the most recently
learned decision tree, using the feature values of the instance, until
a leaf is reached. The decision at the leaf node should be
compared to the correct answer and errors counted.
CATEGORIZE should print out the names (e.g., TEST_POS1, TEST_NEG1, etc)
of the miscategorized examples and report the overall error rate.
During "auto-grading" we will test RUN_ID3, DUMP_TREE,
REPORT_TREE_SIZE, and CATEGORIZE.
When called via
HW2 task.names train_examples.data test_examples.data
have your main function do the following sequence of training runs:
RUN_ID3(task.names, train_examples.data, "info_gain");
DUMP_TREE();
REPORT_TREE_SIZE();
CATEGORIZE(test_examples.data);
RUN_ID3(task.names, train_examples.data, "pruned");
DUMP_TREE();
REPORT_TREE_SIZE();
CATEGORIZE(test_examples.data);
// If you implement this option, it is OK (but NOT necessary) to include this call.
RUN_ID3(task.names, train_examples.data, "accuracy");
DUMP_TREE();
REPORT_TREE_SIZE();
CATEGORIZE(test_examples.data);
RUN_ID3(task.names, train_examples.data, "random");
DUMP_TREE();
REPORT_TREE_SIZE();
CATEGORIZE(test_examples.data);
RUN_ID3(task.names, train_examples.data, "random");
DUMP_TREE();
REPORT_TREE_SIZE();
CATEGORIZE(test_examples.data);
Part 2: Experimenting with the Different Splitting Functions
For these experiments, use the examples of your personal dataset.
Part 2a: Using a Permutation Test to See if Using Information Gain is Better Than Randomly Choosing Features
Run your ID3 on your first train/test set split of HW1, using the
random splitting function. Do this at least 100 times and
create a probability distribution of decision-tree sizes (create about 20 equal-width bins
of tree sizes, e.g. [1-10], [11, 20], [21, 30], etc. - the appropriate width will depend on your 'personal' concept). Next,
using the same train/test set split, apply the "info_gain" and "pruned"
functions you wrote and mark the tree sizes they produce on the same
figure as used for random. Be sure to dump the learned trees.
Discuss the trees and the relative performance of all
of the splitting techniques. Be sure to discuss, for each splitting method,
how likely it is that a random tree is no larger than the tree produced by that splitting function.
Part 2b: Measuring Generalization
Investigate how well the ID3 algorithm works on new (i.e., "testset")
instances as a function of the splitting technique used.
Provide a single figure, with error bars (the 95% confidence intervals), that reports testset error rates as a
function of the splitting strategy used. For the random function,
average over at least 30 runs.
Compute these results using only your first train/test fold (for simplicity).
Discuss your results. Be sure to discuss, for each splitting method,
how likely it is that a random tree has a testset error no worse than that splitting function.
Next, run info_gain and pruned
on all ten
train/test folds of HW1. Using the two-tailed, paired t-test
to pairwise compare
info_gain, pruned, and
the best results you obtained on HW1. Report
in the upper-diagonal of a 3-by-3 table
the judgements of whether or
not the pairwise
differences in error rates are statistically significantly different.
(If accuracy performs better than info_gain for you,
feel free to use it instead of info_gain
in your t-test analysis.)
Part 2c: Investigating the Correlation between Tree Size and Tree Accuracy
Draw a single figure whose X-axis is the size of the tree learned on the
training data and whose Y-axis is the error rate of the tree on the
testing data. Plot all the results for the experiments you ran in Part
2a on your first train/test fold, clearly indicating the splitting
function that produced each point (note that for random there will be at least 100 points).
Discuss your experimental results. How well do they support the "Occam's
Razor" hypothesis?
Part 3: Experimenting with Random Forests
Implement the Random Forest algorithm presented in class. Let i=5 (an arbitrary choice,
but for this HW you need not tune this parameter; if you have less than 10 features,
then let i be one half the number of features you have).
Have each tree, which should not be pruned, in the forest give an unweighted vote
when you need to classify a test example.
Since we learn K decision trees to produce a random forest,
we need to decide how to use this forest to classify a new 'test' example.
For simplicity, assume we label the test example as 'positive' iff a majority
of trees in the forest do so. (If a tie, then output the most common category
in your training set, and if that is a tie, output 'positive.')
Create 100 trees on your trainset 1 (do not print them all out!).
Draw a plot of testset error (on your testset 1) as a function of the number of
decision trees combined into a forest. The X-axis of this plot should range from
1 to 100, and for each X value you should only use the first X trees.
On the Y-axis plot the testset error rate. No need to plot all 100 values, especially
if you draw your graphs by hand. It is ok to draw every 10 or so values
on the X axis. In addition to plotting testset error rate, plot
trainset error rate on the same graph. (If your curve has not flattened
out by the time the forest is of size 100, feel free to consider more trees in
the forest, but doing so is not necessary.)
Using your standard ten train/test folds, do a two-tailed, paired t-test
comparison of random forests to the most accurate approach you've encountered
so far (in HW1 or above). We will not `auto grade' your decision-forest code,
so you are free to use whatever function names you wish for this part of the homework.
Discuss your results.
(You might want also want to compare random forests to using randomly generated
decision trees, but this is not required.)
Part 4: ROC and Recall-Precision Curves
Plot ROC curves - false positives (X-axis) versus true positives (Y-axis) -
for (a) your pruned decision-tree method and (b) random forests; plot BOTH curves
on the same ROC graph for ease of visual comparison. For your pruned decision trees,
use the m-estimated probability (let m=10) that a positive
example reaches a given leaf as the numeric value assigned to each testset example.
Ie, for each leaf, if p positive and n negative training
examples reached it, the m-estimated probability of a test example
being positive given it reaches this leaf is (p + m/2) / (p + n + m)
(if we assume positive and negative examples are apriori equally likely;
it is also fine to use the overall positive-negative skew in the
data set to divvy up the m pseudo-examples).
For your random forests, again use 100 (unpruned) trees and use the count of
trees predicting "true" as the numeric value assigned to each example.
Pool all of your ten testset folds to create your ROC curves.
Repeat the above, but this time draw recall-precision curves.
Discuss the results you plotted in your graphs.
You may use any method you wish to draw the curves,
including "by hand."
Requirements
Hand in a hard-copy of all of the (commented) code you write. Neatly plot the
data requested in the experimental sections. Be sure to label all of your
axes. Type your answers to the questions posed. Include your name on all of
the pages submitted and staple everything together. Unless otherwise noted,
include a printout of your code's output
for the experiments you are asked to run; however,
use your judgement - if this output gets too big, feel free to delete
unimportant portions.
Be sure to place a copy of your code in:
~cs760-1/handin/{your login name}/HW2