CS 540 - Introduction To Artificial Intelligence  Spring(2000)

Homework #1 - Decision Tree Learning


Part I - Information Gain

Consider the criteria for accepting candidates to the PhD program of the mythical University of St. Nordaf. Each candidate is evaluated according to four attributes: the grade point average (GPA), the quality of the undergraduate university attended, the publication record, and the strength of the recommendation letters. To simplify our example, let us discretize and limit the possible values of each attribute: Possible GPA scores are 4.0, 3.7, and 3.5; universities are categorized as top_10, top_20, and top_30 (by top_20 we mean places 11-20 and by top_30 we mean places 21-30.); publication record is a binary attribute - either the applicant has published previously or not; and recommendation letters are similarly binary, they are either good or normal. Finally, the candidates are classified into two classes: accepted, or P (for 'positive'), and rejected, or N (for 'negative'). Here is an example of one possible decision tree determining acceptance.
 

Pat applicant doesn't now this decision tree, but does have the data in this table regarding twelve of last year's applicants:
 

  1. Does the tree given above correctly categorize the given examples?
  2. Pat uses the decision tree algorithm shown in class (with the information gain computations for selecting split variables) to induce  the decision tree employed by St. Nordaf's officials.  What tree will the algorithm come up with? Show all the computations involved and draw the resulting tree.
  3. Given the following example:  {GPA = 4.0; university = top_10; published = yes; recommendation = normal},  would both trees classify the example the same way? In either case (Yes or No), explain why.


Part II - Implementing A Decision Tree

In this part you will be constructing a java implementation of the ID3 algorithm. Here, you will be creating a tree that is capable of determining what number is represented in a LED display consisting of seven LEDs. A LED (light emitting diode) is a type of diode that emits  light when current passes through it. LEDs have many uses, visible LEDs are used as indicator lights on all sorts of electronic devices and in moving-message panels. More information is available via the UC-Irvine archive of machine learning datasets (you should take a look at this site). The seven LEDs in our panel could be on or off and are arranged to form the numbers from 0 to 9 as shown here:

The number three, for example, would be formed with:

{led_1 = on; led_2 = off; led_3 = on; led_4 = on; led_5 = off; led_6 = on; led_7 = on}

Your attributes will be the seven LED's, their values will be on and off, and the classifications will be the numbers 0 through 9. Thus, you will have seven binary attributes and ten classes. There are two supplied data files: the led_Attributes data file contains the attribute names, the number of values for each attribute ( 2 since they are boolean ) and the possible values for each attribute ( 1 representing on and 0 representing off ). The led_Examples data file contains 1500 examples of LED configurations. The format for each example is the values (1 or 0) for led_1 through led_7 separated by commas, and the number that it represents at the end. The above example (the number three) would appear as the following entry in the file: 1,0,1,1,0,1,1,3. The examples file contains some noisy data, so don't expect every example to be faithful to the rule.

Running ID3
Your program will read the above files separating the examples into testing, training, and tuning sets; then carry out a 5-fold cross validation running the ID3 algorithm and thus inducing a decision tree at each fold. At the end of each fold, the program should report the tree size ( number of interior nodes, number of leaf nodes, and total nodes ), then run the testing set examples through the induced tree and report the tree's accuracy on that set of examples. After finishing the 5-fold cross validation, the program should report the average tree size (just the total nodes this time) and the average accuracy over all 5 folds. This first part comprises the non-pruning part of your program.

Pruning ID3
When all the above is done, the program should repeat the 5-fold cross validation, but this time training only on the examples in each fold's training set that were not set aside for pruning(don't train on these tuning examples!). This time, at the end of each fold, the program should prune the induced tree using the tuning set to carry out the pruning.  For this homework, we will separate 1/3 of each fold's training examples as the fold's tuning set and the remaining 2/3 will be used for training. All you have to do is specify this in the command line and the provided code will carry out this separation for you (see the Provided Code Section). A sample pseudocode for the pruning algorithm is provided below:

  Let bestTree = the tree produced by ID3 on the TRAINING set
  Let bestAccuracy = the accuracy of bestTree on the TUNING set
  Let progressMade = true

  while (progressMade) // Continue as long as improvement on TUNING SET
  {

    Set progressMade = false
    Let currentTree = bestTree
    For each interiorNode N (including the root) in currentTree
    {
// Consider various pruned versions of the current tree
        // and see if any are better than the best tree found so far
        Let prunedTree be a copy of currentTree,
        except replace N by a leaf node
        whose label equals the majority class among TRAINING set
        examples that reached node N (break ties in favor of '-')
        Let newAccuracy = accuracy of prunedTree on the TUNING set
        // Is this pruned tree an improvement, based on the TUNE set?
        // When a tie, go with the smaller tree (Occam's Razor).
        If (newAccuracy >= bestAccuracy)
        {
          bestAccuracy = newAccuracy
          bestTree = prunedTree
          progressMade = true
        }
    }

  }
  return bestTree

Finally, report the final pruned tree's size (same way as above) and run the fold's testing set examples through the pruned tree, reporting the tree's accuracy. At the end of this pruning 5-fold cross validation, report the average accuracy and tree size over all five folds.

On each of the above experiments, you should pick one of the folds, say the last one, and print out the corresponding induced tree. We have provided code for you to do this (see Provided Code section), but you can create your own if you desire. The printed trees usually do not fit very well in an 8 1/2  x 11  piece of paper, but it helps a lot to print them in landscape mode. The important thing here is to physically see what the tree looks like in addition its size and accuracy, so we do not expect to see extremely neat trees.

Program Output
Your program's output should be as follows:

ID3:

Fold # 0:

        Tree Size:    Interior Nodes - nnn     Leaf Nodes: - nnn     Total Nodes - nnn
        Tree Accuracy:   nnn

Fold # 1:

        Tree Size:    Interior Nodes - nnn     Leaf Nodes: - nnn     Total Nodes - nnn
        Tree Accuracy:   nnn

Fold # 2:

        Tree Size:    Interior Nodes - nnn     Leaf Nodes: - nnn     Total Nodes - nnn
        Tree Accuracy:   nnn

Fold # 3:

        Tree Size:    Interior Nodes - nnn     Leaf Nodes: - nnn     Total Nodes - nnn
        Tree Accuracy:   nnn

Fold # 4:

        Tree Size:    Interior Nodes - nnn     Leaf Nodes: - nnn     Total Nodes - nnn
        Tree Accuracy:   nnn

Average Tree Size:     nnn
Average Accuracy:     nnn

Selected Tree:

        (Print out your selected tree)
 

PRUNING ID3:

Fold # 0:

        Tree Size:    Interior Nodes - nnn     Leaf Nodes: - nnn     Total Nodes - nnn
        Tree Accuracy:   nnn

Fold # 1:

        Tree Size:    Interior Nodes - nnn     Leaf Nodes: - nnn     Total Nodes - nnn
        Tree Accuracy:   nnn

Fold # 2:

        Tree Size:    Interior Nodes - nnn     Leaf Nodes: - nnn     Total Nodes - nnn
        Tree Accuracy:   nnn

Fold # 3:

        Tree Size:    Interior Nodes - nnn     Leaf Nodes: - nnn     Total Nodes - nnn
        Tree Accuracy:   nnn

Fold # 4:

        Tree Size:    Interior Nodes - nnn     Leaf Nodes: - nnn     Total Nodes - nnn
        Tree Accuracy:   nnn

Average Tree Size:     nnn
Average Accuracy:     nnn

Selected Tree:

        (Print out your tree)

Write a lab report comparing both of the above experiments. Did you find any significant differences between both experiments? If so, which trees are generally better, and why?

Provided Code
For this assignment you are provided with some java code that will read the data files from disk, separate the testing, training and tuning sets, printing the trees, and generate the necessary objects to handle the data. You should download these files to your personal directory and familiarize yourself with them. You are free to modify the code to your own liking or even create your own routines instead of using these. We have also included in the provided code a skeleton program file: runID3.java, from which you will create your own program. You should  not change any of the code that is already in this file, but you can, and will of course find it necessary to,  add other methods and variables to it. When your program is complete, you should be able to run it with the following command:
 

java   runID3   led_Attributes   led_Examples   5  33.33

where runID3 is the name of your program; led_Attributes is the name of the attribute specification file; led_Examples is the name of the examples file; 5 is the number of folds in your cross validation experiments,  and 33.33 is the fraction (1/3) of the training sets that will be separated as the tuning sets.
 
 

Giving Us Access to Your Java Code
The course staff (and nobody else but you) will need to have access to the code you write. The following shows how to
set this up. PLEASE NOTE THAT EVEN IF YOU DO NOT USE THE COURSE MACHINES TO DEVELOP AND
TEST YOUR HW SOLUTIONS, YOU MUST PLACE YOUR *.java AND *.class FILES IN THIS DIRECTORY,
SINCE OUR TAs WILL RUN YOUR CODE DURING GRADING.

In order to avoid conflicts when the TAs are grading one HW and you're working on the next, each homework will
have a separate directory named HWn, where n is the homework number. In addition, because we often need to make
minor changes to your code to get it to run, you will need to give us write permissions in this directory. The following
steps can be used to do this:

   mkdir ~/CS540-handin/
   cd ~/CS540-handin/
   /usr/afsws/bin/fs setacl . system:anyuser none
   /usr/afsws/bin/fs setacl . gerson write
   /usr/afsws/bin/fs setacl . bo write
   /usr/afsws/bin/fs setacl . mariopi write
   /usr/afsws/bin/fs listacl
   mkdir HW1
   [remember, a new directory must be created for each homework]

Be sure to save elsewhere a copy of the code you place in this directory, in case something gets messed up during
grading. Also, once the due date has passed, do not alter your code in your handin directory or you may be charged
late points.

Please note that there is a space on either side of the dot (.) in the above commands. Also, be sure to carefully match the required names (include the use of upper and lowercase). This will greatly simplify the TAs job so they can better spend their allotted hours on the more important aspects of grading. After HW1, we are likely to subtract 2-3 points for misnamed files and directories.
 

What To Turn In
Besides the answer to Part I of this homework, you should hand in a hardcopy listing of your code and the program's required output (the bold underlined parts of the instructions) as specified in the Program Output section. You must also include a neatly written lab report with the required information(also bold underlined). Points will taken off if we do not have a hardcopy of your program, even if you placed  the corresponding files in your handin directory. It is extremely important to strive for clarity and neatness in this and all future homeworks, since it will definitely benefit you throughout the course.