CS540 HW1: Learning Decision Trees from Training Examples (and a Simple k-NN question)

Assigned: Tuesday 9/13/16
Due: Tuesday 9/27/16 at 11:55pm (can be turned in late up until 11:55pm 10/4/16)
Value: 150 points

Note: this is for Prof. Shavlik's CS 540 section.

Problem 1 - Learning Decision Trees (paper-and-pencil)

It is fine to write by hand your solution to this problem, but please be sure your scanned PDF file is legible before turning it in.

Often in machine learning, one randomly divides his or her training data into three subsets called the training, tuning, and testing sets. (Sometimes the tuning set is called a validation or, especially in decision-tree induction, a pruning set.) One first uses the training set to initially learn, then the tuning set to address overfitting, and finally the testing set to estimate how accurately the 'tuned' result will work in the future. Usually, this entire process is repeated multiple times in order to get a statistically sound estimate of future accuracy. However, in this problem we will only address training and testing, and we will only go through this process once. (Plus, we're only using an unrealistically small sample in order to keep this simple. So in Part 1 focus on the algorithm, rather than the intelligence of the results. In Part 3, we'll be using a "real world" dataset.)

In lecture we will cover using a tuning set to choose a good pruned tree, in order to reduce the odds of overfitting. This is a topic that might be on the midterm or final exam, so we recommend you create a tuning set for this task and decide which pruned tree would be chosen. Do not turn in anything about tree pruning for grading, though.

Assume you are using the following features to represent examples:

     COLOR        possible values:     Red, Green, Blue
     AGE          possible values:     Young, Old
     WEIGHT       possible values:     Light, Medium, Heavy
(Since each feature value starts with a different letter, for shorthand we'll just use that initial letter, eg 'R' for Red.)

Our task will be binary valued, and we'll use '+' and '-' as our category labels.

Here is our TRAIN set:

     COLOR = R    AGE = O   WEIGHT = H    CATEGORY = +
     COLOR = B    AGE = Y   WEIGHT = L    CATEGORY = +
     COLOR = G    AGE = Y   WEIGHT = L    CATEGORY = +
     COLOR = R    AGE = Y   WEIGHT = H    CATEGORY = +
     COLOR = G    AGE = O   WEIGHT = L    CATEGORY = -
     COLOR = G    AGE = Y   WEIGHT = L    CATEGORY = -
     COLOR = B    AGE = O   WEIGHT = H    CATEGORY = -

And our TEST set:

     COLOR = B    AGE = Y   WEIGHT = H    CATEGORY = +
     COLOR = G    AGE = Y   WEIGHT = L    CATEGORY = +
     COLOR = B    AGE = O   WEIGHT = L    CATEGORY = - 
     COLOR = R    AGE = Y   WEIGHT = M    CATEGORY = -
     COLOR = R    AGE = O   WEIGHT = L    CATEGORY = -

Part 1a - Inducing the Initial Decision Tree (30 points)

First, apply the decision-tree algorithm in Fig 18.5 of the text to the TRAIN set (we'll call this algorithm ID3 for short from now on; ID3 is a highly influential decision-tree learner created by Ross Quinlan in the early/mid-1980s). Show all your work; as mentioned above, handwritten solutions are fine, but please submit good-quality scans so they are readable during grading.

When multiple features tie as being the best one, choose the one whose name appears earliest in alphabetical order (eg, AGE before COLOR before WEIGHT). When there is a tie in computing MajorityValue, choose '-'.

Part 1b - Estimating future accuracy (10 points)

Apply the decision tree produced by Part 1a to the TESTING examples. Report this tree's accuracy on these examples. Briefly discuss your results.

Problem 2 - k-Nearest Neighbors (10 points)

Using the Trainset of Problem 1, use the 1-nearest neighbor algorithm to predict the output for the third testset example. Show your work. Break any ties in the same as manner as used in Problem 1.

Problem 3 - Building Decision Trees in Java (100 points)

In this part of the homework you will implement in Java a simplified version of the decision-tree induction algorithm of Fig 18.5 (recall that we call it ID3). We'll assume that all features are binary valued.

We will not be using a tuning set in this part of the homework. In HW2 we will investigate an alternate method for reducing overfitting in decision trees.

Recall the 'real world' dataset about wine from HW0. It involves predicting whether or not a given wine is highly rated. We have divided it into a training set and a testing set.

A second sample dataset, one used in early machine-learning research, is also available. We have divided it into a training set and a testing set. You should not turn in anything related to this dataset, but you might want to use it for debugging or to get more experience with decision-tree induction. (Ditto for the Titantic dataset provided for HW0.) The voting dataset is based on actual votes on the US House of Representatives in the 1980's. The task is: given a representative's voting record on the 16 chosen bills, was the representative a Democrat or a Republican? More information is available via the UC-Irvine archive of machine-learning datasets. The voting dataset is one of the few that only involves binary-valued features. Also, hopefully everyone has some intuitions about the task domain.

We also recommend you create 2-3 simple and 'meaning-free' datasets for debugging, ones where you can compute by hand the correct answer (you might even want to use your code to check your answer to Problem 1 of this homework!). By 'meaning free,' we suggest you simply call the features F1, F2, etc. Also consider looking at some old CS540 midterm exams for simple datasets.

We have provided some code (will be released 9/15/16) that reads the data files into some Java data structures. PLEASE DO NOT LOOK AT THIS FILE UNTIL YOU HAVE TURNED IN YOUR HW0. See BuildAndTestDecisionTree.java. You're welcome to use any or all of this provided code. The only requirement is that you create a BuildAndTestDecisionTree class, whose calling convention is as follows:

  java BuildAndTestDecisionTree <trainsetFilename> <testsetFilename> 

Note that you can provide the SAME file name for BOTH training and testing to see how well your code 'fit' the training data (it should get them all correct except for the 'extreme noise' case that was discussed in class). Accuracy on the training set is not of much interest, but it can help during debugging.

See BuildAndTestDecisionTree.java for more details. Do be aware that we will test your solutions on additional datasets beyond those we provided.

Here is what you need to do:

  1. Use the TRAINING SET of examples to build a decision tree.

  2. Print out the induced decision tree (using simple, indented ASCII text; we'll discuss how one might do this in class).

  3. Categorize the TESTING SET using the induced tree, reporting which examples were INCORRECTLY classified, as well as the FRACTION that were incorrectly classified. Just print out the NAMES of the examples incorrectly classified (though during debugging you might wish to print out the full example to see if it was processed correctly by your decision tree).

What to Turn In

Submit your HW1 solution via Moodle. You must turn in only these THREE files: HW1_P1_P2.pdf, HW1_P3.pdf, and BuildAndTestDecisionTree.java.

Place your written answer to Problems 1 and 2 in HW1_P1_P2.pdf. Also turn in your commented Java code in BuildAndTestDecisionTree.java and a neatly written lab report (in HW1_P2.pdf) that includes the material requested above - e.g., printed decision tree for the wine testbed, as well as the test-set accuracies and names of misclassified test-set examples. Please aim to make it easy for the TA to find the requested information in what your code prints out when run. (You might want to use a "debug" flag in your code that is set to false by default but when manually set to true, more information is printed.)

Be sure to briefly discuss in your lab report for Problem 3 the decision tree learned on the wine dataset. Does it make sense? I.e., did it seem to learn something general or did it only 'memorize' the training examples? Is is reasonable it made the test-set errors it did (if any)? Do not discuss EVERY error if there are more than two; in that case, just discuss two testset errors, one 'false positive' (a negative example called positive by the decision tree) and one 'false negative' (a positive example called negative by the learned decision tree). Most likely you will have more than one testset error of each type. If not, discuss any two errors.