University of Wisconsin Computer Sciences Header Map (repeated with 
textual links if page includes departmental footer) Useful Resources Research at UW-Madison CS Dept UW-Madison CS Undergraduate Program UW-Madison CS Graduate Program UW-Madison CS People Useful Information Current Seminars in the CS Department Search Our Site UW-Madison CS Computer Systems Laboratory UW-Madison Computer Sciences Department Home Page UW-Madison Home Page

CS 760 - Machine Learning

Homework 2
Due Monday, March 15, 2009
125 points

Inducing Decision Trees - ID3

In this homework you will investigate Quinlan's ID3 learning algorithm (described in Mitchell's Chapter 3). Your assignment has four parts. First, you will implement ID3, including a (possibly stochastic) pruning algorithm. Second, you will run some experiments with ID3. Third, you will implement and evaluate an ensemble method called random forests. Fourth, you will plot results in ROC and recall-precision curves.

It is acceptable to look at the Java code available from WEKA. However, it is not acceptable to share code with current or former cs760 students (be sure that you have read the Academic Misconduct policy on the class homepage).

Part 1: Implementing ID3

Implement the algorithm in Mitchell's Table 3.1, augmenting it as explained below. You should also implement the method discussed in Section 3.7.2 for handling continuous features (some additional comments on this); you need not deal with hierarchical features on this homework.

If your HW0 dataset has features with lots of possible values (say more than 10), your code should make N Boolean-valued features out of such features (see the course notes). The datasets we used for grading will not have more than 10 possible values.

Part 1a: Handling Additional Splitting Functions

The crucial step in the ID3 algorithm involves choosing which feature to use as the next node in the decision tree. Besides using Quinlan's info_gain measure (Equation 3.4), implement one or more alternatives for use as experimental controls:
(uniformly) randomly choosing one of the remaining features

accuracy (This one is optional, but you might want to experiment with it - note, though, that no extra credit will be given for experimenting with this feature-selection method)
choosing the feature that has the least errors on the current set of examples, where "impure" leaf nodes use the majority category to "model" their data (e.g., if 40 positive and 10 negatives follow the left branch and 15 positives and 35 negatives follow the right branch, then "guessing the majority" will produce 25 total errors - 10 on the left branch and 15 on the right branch)

You should implement a function called RUN_ID3(String namesFile, String trainsetFile, String splittingFunction), whose arguments are, in order, the file describing the examples, the trainset to use, and the splitting function to use. (one of: info_gain, random, accuracy).

You should also write a function DUMP_TREE() that prints, in some reasonable fashion, the most recently learned decision tree and the function REPORT_TREE_SIZE() which reports the number of interior and leaf nodes (and their sum) in the most recently learned decision tree. To prevent wasting paper, limit the maximum depth of printed trees to THREE interior nodes (including the root); wherever the tree is deeper than that, simply print something like "there are N interior nodes below this one; P positive and N negative training examples reached this node."

Part 1b: Avoiding Overfitting

A decision tree that correctly classifies all the training examples might be too complicated; some subtree might generalize better to future examples. In this homework, you will implement a stochastic search for a simpler tree, according to the following (call this variant pruned):
  1. Given a training set, create a train' set and a tuning (or "pruning") set; place 20% of the examples in the tuning set.
  2. Create a tree that fully fits the train' set; use info_gain as the scoring function for features. Call it Tree. Set scoreOfBestTreeFound to the score of Tree on the tuning set.
  3. Number all the interior nodes from 1 to N (eg, using a "preorder" traversal; however the precise method for numbering nodes doesn't matter, as long as each interior node is counted once and only once).
  4. L times do the following
  5. Return the best tree found (on the tune set).
Let L=100 and K=5 (arbitrarily chosen values; if your code runs too slowly, feel free to adjust L but be sure to document this in your HW writeup - also feel free to use higher values for L).

Option: if you prefer, you may instead implement the greedy algorithm presented in class for finding a good decision tree based on tuning-set accuracy.

Part 1c: Making Decisions

You are to also create the function CATEGORIZE(testsetFile). This function takes as input a string indicating which testset to use. For each example in this list, the code should traverse the most recently learned decision tree, using the feature values of the instance, until a leaf is reached. The decision at the leaf node should be compared to the correct answer and errors counted.

CATEGORIZE should print out the names (e.g., TEST_POS1, TEST_NEG1, etc) of the miscategorized examples and report the overall error rate.

During "auto-grading" we will test RUN_ID3, DUMP_TREE, REPORT_TREE_SIZE, and CATEGORIZE. When called via

HW2 task.names

have your main function do the following sequence of training runs:

     RUN_ID3(task.names,, "info_gain");

     RUN_ID3(task.names,, "pruned");

     // If you implement this option, it is OK (but NOT necessary) to include this call.
     RUN_ID3(task.names,, "accuracy");

     RUN_ID3(task.names,, "random");

     RUN_ID3(task.names,, "random");

Part 2: Experimenting with the Different Splitting Functions

For these experiments, use the examples of your personal dataset.

Part 2a: Using a Permutation Test to See if Using Information Gain is Better Than Randomly Choosing Features

Run your ID3 on your first train/test set split of HW1, using the random splitting function. Do this at least 100 times and create a probability distribution of decision-tree sizes (create about 20 equal-width bins of tree sizes, e.g. [1-10], [11, 20], [21, 30], etc. - the appropriate width will depend on your 'personal' concept). Next, using the same train/test set split, apply the "info_gain" and "pruned" functions you wrote and mark the tree sizes they produce on the same figure as used for random. Be sure to dump the learned trees. Discuss the trees and the relative performance of all of the splitting techniques. Be sure to discuss, for each splitting method, how likely it is that a random tree is no larger than the tree produced by that splitting function.

Part 2b: Measuring Generalization

Investigate how well the ID3 algorithm works on new (i.e., "testset") instances as a function of the splitting technique used. Provide a single figure, with error bars (the 95% confidence intervals), that reports testset error rates as a function of the splitting strategy used. For the random function, average over at least 30 runs. Compute these results using only your first train/test fold (for simplicity). Discuss your results. Be sure to discuss, for each splitting method, how likely it is that a random tree has a testset error no worse than that splitting function.

Next, run info_gain and pruned on all ten train/test folds of HW1. Using the two-tailed, paired t-test to pairwise compare info_gain, pruned, and the best results you obtained on HW1. Report in the upper-diagonal of a 3-by-3 table the judgements of whether or not the pairwise differences in error rates are statistically significantly different. (If accuracy performs better than info_gain for you, feel free to use it instead of info_gain in your t-test analysis.)

Part 2c: Investigating the Correlation between Tree Size and Tree Accuracy

Draw a single figure whose X-axis is the size of the tree learned on the training data and whose Y-axis is the error rate of the tree on the testing data. Plot all the results for the experiments you ran in Part 2a on your first train/test fold, clearly indicating the splitting function that produced each point (note that for random there will be at least 100 points). Discuss your experimental results. How well do they support the "Occam's Razor" hypothesis?

Part 3: Experimenting with Random Forests

Implement the Random Forest algorithm presented in class. Let i=5 (an arbitrary choice, but for this HW you need not tune this parameter; if you have less than 10 features, then let i be one half the number of features you have). Have each tree, which should not be pruned, in the forest give an unweighted vote when you need to classify a test example.

Since we learn K decision trees to produce a random forest, we need to decide how to use this forest to classify a new 'test' example. For simplicity, assume we label the test example as 'positive' iff a majority of trees in the forest do so. (If a tie, then output the most common category in your training set, and if that is a tie, output 'positive.')

Create 100 trees on your trainset 1 (do not print them all out!). Draw a plot of testset error (on your testset 1) as a function of the number of decision trees combined into a forest. The X-axis of this plot should range from 1 to 100, and for each X value you should only use the first X trees. On the Y-axis plot the testset error rate. No need to plot all 100 values, especially if you draw your graphs by hand. It is ok to draw every 10 or so values on the X axis. In addition to plotting testset error rate, plot trainset error rate on the same graph. (If your curve has not flattened out by the time the forest is of size 100, feel free to consider more trees in the forest, but doing so is not necessary.)

Using your standard ten train/test folds, do a two-tailed, paired t-test comparison of random forests to the most accurate approach you've encountered so far (in HW1 or above). We will not `auto grade' your decision-forest code, so you are free to use whatever function names you wish for this part of the homework.

Discuss your results.

(You might want also want to compare random forests to using randomly generated decision trees, but this is not required.)

Part 4: ROC and Recall-Precision Curves

Plot ROC curves - false positives (X-axis) versus true positives (Y-axis) - for (a) your pruned decision-tree method and (b) random forests; plot BOTH curves on the same ROC graph for ease of visual comparison. For your pruned decision trees, use the m-estimated probability (let m=10) that a positive example reaches a given leaf as the numeric value assigned to each testset example. Ie, for each leaf, if p positive and n negative training examples reached it, the m-estimated probability of a test example being positive given it reaches this leaf is (p + m/2) / (p + n + m) (if we assume positive and negative examples are apriori equally likely; it is also fine to use the overall positive-negative skew in the data set to divvy up the m pseudo-examples).

For your random forests, again use 100 (unpruned) trees and use the count of trees predicting "true" as the numeric value assigned to each example.

Pool all of your ten testset folds to create your ROC curves.

Repeat the above, but this time draw recall-precision curves.

Discuss the results you plotted in your graphs. You may use any method you wish to draw the curves, including "by hand."


Hand in a hard-copy of all of the (commented) code you write. Neatly plot the data requested in the experimental sections. Be sure to label all of your axes. Type your answers to the questions posed. Include your name on all of the pages submitted and staple everything together. Unless otherwise noted, include a printout of your code's output for the experiments you are asked to run; however, use your judgement - if this output gets too big, feel free to delete unimportant portions. Be sure to place a copy of your code in:

~cs760-1/handin/{your login name}/HW2