CS540 HW1: Learning Decision Trees from Training Examples

Assigned: Friday 2/1/08
Due: Monday 2/18/08
Value: 100 points

Problem 1 - Pruning Decision Trees

Often in machine learning, one randomly divides his or her training data into three subsets called the training, tuning, and testing sets. (Sometimes the tuning set is call a validation or, especially in decision-tree induction, a pruning set.) One first uses the training set to initially learn, then the tuning set to address overfitting, and finally the testing set to estimate how accurately the 'tuned' result will work in the future. Usually, this entire process is repeated multiple times in order to get a statistically sound estimate of future accuracy. However, in this problem we'll only go through this process once. (Plus, we're only using an unrealistically small sample in order to keep this simple. So in Part 1 focus on the algorithm, rather than the intelligence of the results. In Part 2, we'll be using a "real world" dataset.)

Assume you are using the following features to represent examples:

     SHAPE        possible values:     Circle, Ellipse, Square, Triangle
     AGE          possible values:     Young, Old
     WORTH        possible values:     Low, High
(Since each feature value starts with a different letter, for shorthand we'll just use that initial letter, eg 'C' for Circle.)

Our task will be binary valued, and we'll use '+' and '-' as our category labels.

Here is our TRAIN set:

     SHAPE = C    AGE = Y   WORTH = L    CATEGORY = -
     SHAPE = E    AGE = O   WORTH = L    CATEGORY = -
     SHAPE = C    AGE = Y   WORTH = H    CATEGORY = -
     SHAPE = C    AGE = O   WORTH = H    CATEGORY = -
     SHAPE = S    AGE = O   WORTH = H    CATEGORY = +
     SHAPE = E    AGE = Y   WORTH = L    CATEGORY = +
     SHAPE = E    AGE = Y   WORTH = H    CATEGORY = +

Our TUNE set:

     SHAPE = C    AGE = O   WORTH = L    CATEGORY = -
     SHAPE = S    AGE = O   WORTH = H    CATEGORY = +
     SHAPE = E    AGE = O   WORTH = H    CATEGORY = +
     SHAPE = E    AGE = Y   WORTH = H    CATEGORY = -
     SHAPE = C    AGE = Y   WORTH = L    CATEGORY = -

And our TEST set:

     SHAPE = C    AGE = O   WORTH = H     CATEGORY = -
     SHAPE = C    AGE = Y   WORTH = L     CATEGORY = -
     SHAPE = T    AGE = Y   WORTH = H     CATEGORY = +
     SHAPE = E    AGE = Y   WORTH = L     CATEGORY = +
     SHAPE = E    AGE = O   WORTH = H     CATEGORY = +

Part 1a - Inducing the initial decision tree

First, apply the decision-tree algorithm in Fig 18.5 of the text (we'll call this algorithm ID3 from now on) to the TRAIN set. Show all your work.

When multiple features tie as being the best one, choose the one whose name appears earliest in alphabetical order (eg, AGE before SHAPE before WORTH). When there is a tie in computing MajorityValue, choose '-'.

Part 1b - Pruning the tree to reduce overfitting

Overfitting occurs when a decision tree conforms too closely to the training data and does not accurately model the underlying concept. One way to address this problem in decision-tree induction is to use a tuning set in conjunction with a pruning algorithm. For this assignment we will use the 'greedy' algorithm sketched below.

  Let bestTree = the tree produced by ID3 on the TRAINING set
  Let bestAccuracy = the accuracy of bestTree on the TUNING set
  Let progressMade = true

  while (progressMade) // Continue as long as improvement on TUNING SET
  {
    Set progressMade = false
    Let currentTree = bestTree

    For each interiorNode N (including the root) in currentTree
    {   // Consider various pruned versions of the current tree
        // and see if any are better than the best tree found so far

        Let prunedTree be a copy of currentTree,
        except replace N by a leaf node
        whose label equals the majority class among TRAINING set
        examples that reached node N (break ties in favor of '-')

        Let newAccuracy = accuracy of prunedTree on the TUNING set

        // Is this pruned tree an improvement, based on the TUNE set?
        // When a tie, go with the smaller tree (Occam's Razor).
        If (newAccuracy >= bestAccuracy)
        {
          bestAccuracy = newAccuracy
          bestTree = prunedTree
          progressMade = true
        }
    }
  }
  return bestTree

Apply the above pruning algorithm to the tree you produced in Part 1a. Show your work (eg, show the intermediate trees considered and their tuning-set accuracies).

Part 1c - Estimating future accuracy

Apply the decision tree produced by Part 1b's pruning algorithm to the TESTING examples. Report this tree's accuracy on these examples, as well as the accuracy of the unpruned tree from Part 1a. Briefly discuss your results.

Problem 2 - Building Decision Trees in Java

In this part of the homework you will implement in Java a simplified version of the decision-tree induction algorithm of Fig 18.5, which we'll call ID3. We'll assume that all features are binary valued.

A sample dataset used in early machine-learning research is available. We've divided it into a training set and a testing set . We will not be using a tuning set in this part of the homework.

This dataset is based on actual votes on the US House of Representatives in the 1980's. The task is: given a representative's voting record on the 16 chosen bills, was the representative a Democrat or a Republican. More information is available via the UC-Irvine archive of machine-learning datasets. The voting dataset is rather old, but it is one of the few that only involves binary-valued features. Also, hopefully everyone has some intuitions about the task domain.

We have provided some code that reads the data files into some Java data structures. See BuildAndTestDecisionTree.java. You're welcome to using any or all of this provided code. The only requirement is that you create a BuildAndTestDecisionTree class, whose calling convention is as follows:


  java BuildAndTestDecisionTree <trainsetFilename> <testsetFilename> 

See BuildAndTestDecisionTree.java for more details. Do be aware that we will test your solutions on additional datasets.

Here is what you need to do:

  1. Use the TRAINING SET of examples to build a decision tree.

  2. Print out the induced decision tree (using simple, indented ASCII text; we'll discuss how one might do this class some day).

  3. Categorize the TESTING SET using the induced tree, reporting which examples were INCORRECTLY classified, as well as the FRACTION that were incorrectly classified. Just print out the NAMES of the examples incorrectly classified (though during debugging you might wish to print out the full example to see if it was processed correctly by your decision tree).

Giving Us Access to Your Java Code

In order to use the lab-supported handin program, you will need to modify your "unix path." To do this, follow the instructions on the CSL web page. You will be adding the path /s/handin/bin within your .cshrc.local file. (If you have never done this before, see the TA for help soon; do not wait until the last minute.)

Run the following command:

    handin -c cs540-1 -a <assignment_name> -d <directory>
Note: <assignment_name> is the name of each assignment and <directory> the path to the directory where all your files are located.

Put your solution in one of your directories, and run the above command to turn in your program. To make sure you do it sucessfully, go to "~cs540-1/handin/{yourName}/{assignment}" to check you have copied all the files there.

Be sure to save elsewhere a copy of the code you place in this directory, in case something gets messed up during grading. Also, once the due date has passed, do not alter your code in your handin directory or you may be charged late points.

Finally, be sure to carefully match the required names (include the use of upper and lowercase). This will greatly simply the TA's job and she can better spend her allotted hours on more important aspects of grading. After HW1, we are likely to subtract 2-3 points for misnamed files and directories.

What to Turn In

Put your solution in the hand-in directory as described above. In addition to your written answer to Problem 1, also turn in a printout of your commented Java code and a neatly written lab report that includes the material (eg, printed decision tree) and information requested above.