CS540 HW1: Learning Decision Trees from Training Examples

Assigned: Friday 9/9/11
Due: Monday 9/26/11
Value: 150 points

Note: this is for Prof. Shavlik's CS 540 section.

Problem 1 - Pruning Decision Trees

Often in machine learning, one randomly divides his or her training data into three subsets called the training, tuning, and testing sets. (Sometimes the tuning set is call a validation or, especially in decision-tree induction, a pruning set.) One first uses the training set to initially learn, then the tuning set to address overfitting, and finally the testing set to estimate how accurately the 'tuned' result will work in the future. Usually, this entire process is repeated multiple times in order to get a statistically sound estimate of future accuracy. However, in this problem we'll only go through this process once. (Plus, we're only using an unrealistically small sample in order to keep this simple. So in Part 1 focus on the algorithm, rather than the intelligence of the results. In Part 2, we'll be using two "real world" datasets.)

Assume you are using the following features to represent examples:

     SHAPE        possible values:     Circle, Ellipse, Square, Triangle
     COLOR        possible values:     Red, Blue
     SIZE         possible values:     Medium, Large, Huge
(Since each feature value starts with a different letter, for shorthand we'll just use that initial letter, eg 'C' for Circle.)

Our task will be binary valued, and we'll use '+' and '-' as our category labels.

Here is our TRAIN set:

     SHAPE = S    COLOR = R   SIZE = L    CATEGORY = +
     SHAPE = S    COLOR = R   SIZE = M    CATEGORY = -
     SHAPE = C    COLOR = B   SIZE = H    CATEGORY = +
     SHAPE = E    COLOR = R   SIZE = H    CATEGORY = -
     SHAPE = S    COLOR = B   SIZE = M    CATEGORY = -
     SHAPE = E    COLOR = B   SIZE = L    CATEGORY = -
     SHAPE = C    COLOR = R   SIZE = H    CATEGORY = +

Our TUNE set:

     SHAPE = E    COLOR = R   SIZE = M    CATEGORY = -
     SHAPE = S    COLOR = B   SIZE = H    CATEGORY = +
     SHAPE = E    COLOR = R   SIZE = H    CATEGORY = +
     SHAPE = C    COLOR = R   SIZE = M    CATEGORY = -
     SHAPE = S    COLOR = B   SIZE = L    CATEGORY = +

And our TEST set:

     SHAPE = S    COLOR = B   SIZE = H     CATEGORY = +
     SHAPE = C    COLOR = R   SIZE = L     CATEGORY = +
     SHAPE = T    COLOR = R   SIZE = M     CATEGORY = -
     SHAPE = E    COLOR = B   SIZE = M     CATEGORY = -
     SHAPE = E    COLOR = B   SIZE = H     CATEGORY = -

Part 1a - Inducing the initial decision tree

First, apply the decision-tree algorithm in Fig 18.5 of the text (we'll call this algorithm ID3 from now on) to the TRAIN set. Show all your work.

When multiple features tie as being the best one, choose the one whose name appears earliest in alphabetical order (eg, COLOR before SHAPE before SIZE). When there is a tie in computing MajorityValue, choose '-'.

Part 1b - Pruning the tree to reduce overfitting

Overfitting occurs when a decision tree conforms too closely to the training data and does not accurately model the underlying concept. One way to address this problem in decision-tree induction is to use a tuning set in conjunction with a pruning algorithm. For this assignment we will use the 'greedy' algorithm sketched below.

  Let bestTree = the tree produced by ID3 on the TRAINING set
  Let bestAccuracy = the accuracy of bestTree on the TUNING set
  Let progressMade = true

  while (progressMade) // Continue as long as improvement on TUNING SET
  {
    Set progressMade = false
    Let currentTree = bestTree

    For each interiorNode N (including the root) in currentTree
    {   // Consider various pruned versions of the current tree
        // and see if any are better than the best tree found so far

        Let prunedTree be a copy of currentTree,
        except replace N by a leaf node
        whose label equals the majority class among TRAINING set
        examples that reached node N (break ties in favor of '-')

        Let newAccuracy = accuracy of prunedTree on the TUNING set

        // Is this pruned tree an improvement, based on the TUNE set?
        // When a tie, go with the smaller tree (Occam's Razor).
        If (newAccuracy >= bestAccuracy)
        {
          bestAccuracy = newAccuracy
          bestTree = prunedTree
          progressMade = true
        }
    }
  }
  return bestTree

Apply the above pruning algorithm to the tree you produced in Part 1a. Show your work (eg, show the intermediate trees considered and their tuning-set accuracies).

Part 1c - Estimating future accuracy

Apply the decision tree produced by Part 1b's pruning algorithm to the TESTING examples. Report this tree's accuracy on these examples, as well as the accuracy of the unpruned tree from Part 1a. Briefly discuss your results.

Problem 2 - Building Decision Trees in Java

In this part of the homework you will implement in Java a simplified version of the decision-tree induction algorithm of Fig 18.5 (recall that we call it ID3). We'll assume that all features are binary valued.

We will not be using a tuning set in this part of the homework.

We are providing two 'real world' datasets for your use. One involves predicting who survived the sinking of the Titantic and only has a few features. We've divided it into a training set and a testing set.

A second sample dataset, one used in early machine-learning research, is also available. We've divided it into a training set and a testing set. This dataset is based on actual votes on the US House of Representatives in the 1980's. The task is: given a representative's voting record on the 16 chosen bills, was the representative a Democrat or a Republican? More information is available via the UC-Irvine archive of machine-learning datasets. The voting dataset is rather old, but it is one of the few that only involves binary-valued features. Also, hopefully everyone has some intuitions about the task domain.

We have provided some code that reads the data files into some Java data structures. See BuildAndTestDecisionTree.java. You're welcome to using any or all of this provided code. The only requirement is that you create a BuildAndTestDecisionTree class, whose calling convention is as follows:


  java BuildAndTestDecisionTree <trainsetFilename> <testsetFilename> 

See BuildAndTestDecisionTree.java for more details. Do be aware that we will test your solutions on additional datasets.

Here is what you need to do:

  1. Use the TRAINING SET of examples to build a decision tree.

  2. Print out the induced decision tree (using simple, indented ASCII text; we'll discuss how one might do this in class some day soon).

  3. Categorize the TESTING SET using the induced tree, reporting which examples were INCORRECTLY classified, as well as the FRACTION that were incorrectly classified. Just print out the NAMES of the examples incorrectly classified (though during debugging you might wish to print out the full example to see if it was processed correctly by your decision tree).

Giving Us Access to Your Java Code

In order to use the lab-supported handin program, you will need to modify your "unix path." To do this, follow the instructions on the CSL web page. You will be adding the path /s/handin/bin within your .bashrc.local file. (If you have never done this before, see the TA for help soon; do not wait until the last minute.)

Run the following command:

    handin -c cs540-2 -a <assignment_name> -d <directory>
Note: <assignment_name> is the name of each assignment (for this homework it is HW1) and <directory> is the path to the directory where all your files are located.

Put your solution in one of your directories, and run the above command to turn in your program. To make sure you do it sucessfully, go to "~cs540-2/handin/{yourName}/{assignment}" to check you have copied all the files there.

Be sure to save elsewhere a copy of the code you place in this directory, in case something gets messed up during grading. Also, once the due date has passed, do not alter your code in your handin directory or you may be charged late points.

Finally, be sure to carefully match the required names (include the use of upper and lowercase). This will greatly simply the TA's job and he can better spend his allotted hours on more important aspects of grading. After HW1, we are likely to subtract 2-3 points for misnamed files and directories.

What to Turn In

Put your solution in the hand-in directory as described above. In addition to your written answer to Problem 1, also turn in a printout of your commented Java code and a neatly written lab report that includes the material requested above - eg, printed decision trees for the two provided testbeds, as well as the test-set accuracies and names of misclassified test-set examples. Include instructions on how to compile and run your program in the lab report. Please aim to make it easy for the TA to find the requested information in what your code prints out when run. (You might want to use a "debug" flag in your code that is set to false by default but when manually set to true more information is printed.)

Be sure to briefly discuss the two decision-trees learned. Do they make sense? I.e., did they seem to learn something general or did they only 'memorize' the training examples? Is is reasonable they made the test-set errors they made (if any)? No need to discuss EVERY error; discuss at most two testset errors per dataset.