CS540 HW1: Learning Decision Trees from Training Examples

Assigned: Monday 2/4/13
Due: Monday 2/18/13 at 11:59pm (can be turned in late up until 10:50am 2/25/13)
Value: 150 points

Note: this is for Prof. Shavlik's CS 540 section.

Problem 1 - Pruning Decision Trees

It is fine to write by hand your solution to this problem, but please be sure your scanned PDF file is legible before turning it in.

Often in machine learning, one randomly divides his or her training data into three subsets called the training, tuning, and testing sets. (Sometimes the tuning set is call a validation or, especially in decision-tree induction, a pruning set.) One first uses the training set to initially learn, then the tuning set to address overfitting, and finally the testing set to estimate how accurately the 'tuned' result will work in the future. Usually, this entire process is repeated multiple times in order to get a statistically sound estimate of future accuracy. However, in this problem we'll only go through this process once. (Plus, we're only using an unrealistically small sample in order to keep this simple. So in Part 1 focus on the algorithm, rather than the intelligence of the results. In Part 2, we'll be using a "real world" dataset.)

Assume you are using the following features to represent examples:

     SHAPE        possible values:     Circle, Ellipse, Square, Triangle
     COLOR        possible values:     Red, Blue
     SIZE         possible values:     Medium, Large, Huge
(Since each feature value starts with a different letter, for shorthand we'll just use that initial letter, eg 'C' for Circle.)

Our task will be binary valued, and we'll use '+' and '-' as our category labels.

Here is our TRAIN set:

     SHAPE = S    COLOR = R   SIZE = L    CATEGORY = +
     SHAPE = C    COLOR = R   SIZE = H    CATEGORY = +
     SHAPE = C    COLOR = B   SIZE = H    CATEGORY = +
     SHAPE = T    COLOR = R   SIZE = L    CATEGORY = +
     SHAPE = S    COLOR = B   SIZE = M    CATEGORY = -
     SHAPE = E    COLOR = B   SIZE = L    CATEGORY = -
     SHAPE = C    COLOR = R   SIZE = M    CATEGORY = -

Our TUNE set:

     SHAPE = C    COLOR = B   SIZE = L    CATEGORY = +
     SHAPE = C    COLOR = B   SIZE = H    CATEGORY = +
     SHAPE = E    COLOR = R   SIZE = L    CATEGORY = -
     SHAPE = S    COLOR = R   SIZE = H    CATEGORY = -
     SHAPE = S    COLOR = R   SIZE = M    CATEGORY = -

And our TEST set:

     SHAPE = C    COLOR = B   SIZE = H     CATEGORY = +
     SHAPE = C    COLOR = R   SIZE = L     CATEGORY = +
     SHAPE = C    COLOR = B   SIZE = L     CATEGORY = +
     SHAPE = E    COLOR = B   SIZE = M     CATEGORY = -
     SHAPE = S    COLOR = R   SIZE = L     CATEGORY = - 

Part 1a - Inducing the initial decision tree

First, apply the decision-tree algorithm in Fig 18.5 of the text to the TRAIN set (we'll call this algorithm ID3 for short from now on; ID3 is a highly influential decision-tree learner created by Ross Quinlan in the early/mid-1980s). Show all your work.

When multiple features tie as being the best one, choose the one whose name appears earliest in alphabetical order (eg, COLOR before SHAPE before SIZE). When there is a tie in computing MajorityValue, choose '-'.

Part 1b - Pruning the tree to reduce overfitting

Overfitting occurs when a decision tree conforms too closely to the training data and does not accurately model the underlying concept. One way to address this problem in decision-tree induction is to use a tuning set in conjunction with a pruning algorithm. For this assignment we will use the 'greedy' algorithm sketched below.

  Let bestTree = the tree produced by ID3 on the TRAINING set
  Let bestAccuracy = the accuracy of bestTree on the TUNING set
  Let progressMade = true

  while (progressMade) // Continue as long as improvement on TUNING SET
  {
    Set progressMade = false
    Let currentTree = bestTree

    For each interiorNode N (including the root) in currentTree
    {   // Consider various pruned versions of the current tree
        // and see if any are better than the best tree found so far

        Let prunedTree be a copy of currentTree,
        except replace N by a leaf node
        whose label equals the majority class among TRAINING set
        examples that reached node N (break ties in favor of '-')

        Let newAccuracy = accuracy of prunedTree on the TUNING set

        // Is this pruned tree an improvement, based on the TUNE set?
        // When a tie, go with the smaller tree (Occam's Razor).
        If (newAccuracy >= bestAccuracy)
        {
          bestAccuracy = newAccuracy
          bestTree = prunedTree
          progressMade = true
        }
    }
  }
  return bestTree

Apply the above pruning algorithm to the tree you produced in Part 1a. Show your work (eg, show the intermediate trees considered and their tuning-set accuracies).

Part 1c - Estimating future accuracy

Apply the decision tree produced by Part 1b's pruning algorithm to the TESTING examples. Report this tree's accuracy on these examples, as well as the accuracy of the unpruned tree from Part 1a. Briefly discuss your results.

Problem 2 - Building Decision Trees in Java

In this part of the homework you will implement in Java a simplified version of the decision-tree induction algorithm of Fig 18.5 (recall that we call it ID3). We'll assume that all features are binary valued.

We will not be using a tuning set in this part of the homework.

We are providing a 'real world' dataset for your use. It involves the disease hepatitis and the prediction is whether or not a given patient survived. We've divided it into a training set and a testing set. Our dataset is derived from this one. We converted all features to be binary-valued and we replaced missing feature values in a simplistic manner.

Use Wikepedia or an on-line search engine to find out the meaning of those features whose name is unfamilar to you. This page might help. Do image searches at your own risk :-) Do not turn in feature definitions, but knowing a little about them will help you interpret the decision trees learned. (The cs540 exams will not ask any questions about hepatitis.)

A second sample dataset, one used in early machine-learning research, is also available. We've divided it into a training set and a testing set. You should not turn in anything related to this dataset, but you might want to use it for debugging or it get more experience with decision-tree induction. (Ditto for the Titantic dataset provided for HW0.) The voting dataset is based on actual votes on the US House of Representatives in the 1980's. The task is: given a representative's voting record on the 16 chosen bills, was the representative a Democrat or a Republican? More information is available via the UC-Irvine archive of machine-learning datasets. The voting dataset is one of the few that only involves binary-valued features. Also, hopefully everyone has some intuitions about the task domain.

THE FOLLOWING JAVA FILE IS CLOSELY RELATED TO HW0 AND SO WILL NOT BE PROVIDED UNTIL AFTER CLASS FEB 6; IE, SOON AFTER THE LATEST TIME (10:50AM 2/6/13) THAT HW0 CAN BE TURNED IN.

We have provided some code that reads the data files into some Java data structures. See BuildAndTestDecisionTree.java. You're welcome to using any or all of this provided code. The only requirement is that you create a BuildAndTestDecisionTree class, whose calling convention is as follows:


  java BuildAndTestDecisionTree <trainsetFilename> <testsetFilename> 

Note that you can provide the SAME file name for BOTH training and testing to see how well your code 'fit' the training data. Accuracy on the training set is not of much interest, but it can help during debugging.

See BuildAndTestDecisionTree.java for more details. Do be aware that we will test your solutions on additional datasets beyond the two we provided.

Here is what you need to do:

  1. Use the TRAINING SET of examples to build a decision tree.

  2. Print out the induced decision tree (using simple, indented ASCII text; we'll discuss how one might do this in class some day soon if not already and you can see a sample printout here).

  3. Categorize the TESTING SET using the induced tree, reporting which examples were INCORRECTLY classified, as well as the FRACTION that were incorrectly classified. Just print out the NAMES of the examples incorrectly classified (though during debugging you might wish to print out the full example to see if it was processed correctly by your decision tree).

What to Turn In

Submit your HW1 solution via Moodle. You must turn in only these three files: HW1_P1.pdf, HW1_P2.pdf, and BuildAndTestDecisionTree.java.

Place your written answer to Problem 1 in HW1_P1.pdf. Also turn in your commented Java code in BuildAndTestDecisionTree.java and a neatly written lab report (in HW1_P2.pdf) that includes the material requested above - eg, printed decision trees for the hepatitis testbed, as well as the test-set accuracies and names of misclassified test-set examples. Please aim to make it easy for the TA to find the requested information in what your code prints out when run. (You might want to use a "debug" flag in your code that is set to false by default but when manually set to true more information is printed.)

Be sure to briefly discuss in your lab report for Problem 2 the decision tree learned. Does it make sense? I.e., did it seem to learn something general or did it only 'memorize' the training examples? Is is reasonable it made the test-set errors it did (if any)? Do not discuss EVERY error if there are more than two; in that case, just discuss two testset errors. (Most likely you will have something like a half dozen testset errors.)