Part I - Information Gain
Consider the criteria for accepting candidates to the PhD program of the
mythical University of St. Nordaf. Each candidate is evaluated according to four
attributes: the grade point average (GPA), the quality of the undergraduate
university attended, the publication record, and the strength of the
recommendation letters. To simplify our example, let us discretize and limit the
possible values of each attribute: Possible GPA scores are 4.0, 3.7, and 3.5;
universities are categorized as top_10, top_20, and top_30 (by top_20 we mean
places 11-20 and by top_30 we mean places 21-30.); publication record is a
binary attribute - either the applicant has published previously or not; and
recommendation letters are similarly binary, they are either good or normal.
Finally, the candidates are classified into two classes: accepted, or P (for
'positive'), and rejected, or N (for 'negative'). Here is an example of one
possible decision tree determining acceptance.
Pat applicant doesn't now this decision tree, but does have the data in this table regarding
twelve of last year's applicants:
Part II - Implementing A Decision Tree
In this part you will be constructing a java implementation of the ID3 algorithm. Here, you will be creating a tree that is capable of determining what number is represented in a LED display consisting of seven LEDs. A LED (light emitting diode) is a type of diode that emits light when current passes through it. LEDs have many uses, visible LEDs are used as indicator lights on all sorts of electronic devices and in moving-message panels. More information is available via the UC-Irvine archive of machine learning datasets (you should take a look at this site). The seven LEDs in our panel could be on or off and are arranged to form the numbers from 0 to 9 as shown here:
The number three, for example, would be formed with:
{led_1 = on; led_2 = off; led_3 = on; led_4 = on; led_5 = off; led_6 = on; led_7 = on}
Your attributes will be the seven LED's, their values will be on and off, and the classifications will be the numbers 0 through 9. Thus, you will have seven binary attributes and ten classes. There are two supplied data files: the led_Attributes data file contains the attribute names, the number of values for each attribute ( 2 since they are boolean ) and the possible values for each attribute ( 1 representing on and 0 representing off ). The led_Examples data file contains 1500 examples of LED configurations. The format for each example is the values (1 or 0) for led_1 through led_7 separated by commas, and the number that it represents at the end. The above example (the number three) would appear as the following entry in the file: 1,0,1,1,0,1,1,3. The examples file contains some noisy data, so don't expect every example to be faithful to the rule.
Running ID3
Your program will read the above files separating the
examples into testing, training, and tuning sets; then carry out a 5-fold cross
validation running the ID3 algorithm and thus inducing a decision tree at each
fold. At the end of each fold, the program should report the tree
size ( number of interior nodes, number of leaf nodes, and total nodes
), then run the testing set examples through the induced tree and report
the tree's accuracy on that set of examples. After finishing the 5-fold
cross validation, the program should report the average tree size (just
the total nodes this time) and the average accuracy over all 5 folds.
This first part comprises the non-pruning part of your program.
Pruning ID3
When all the above is done, the program should repeat the
5-fold cross validation, but this time training only on the examples in each
fold's training set that were not set aside for pruning(don't train on these
tuning examples!). This time, at the end of each fold, the program should prune
the induced tree using the tuning set to carry out the pruning. For this
homework, we will separate 1/3 of each fold's training examples as the fold's
tuning set and the remaining 2/3 will be used for training. All you have to do
is specify this in the command line and the provided code will carry out this
separation for you (see the Provided Code Section). A sample pseudocode for the
pruning algorithm is provided below:
Let bestTree = the tree produced by ID3 on the TRAINING set
while (progressMade) // Continue as long as improvement on TUNING
SET
{
Set progressMade = false
For each interiorNode N (including the root) in currentTree
// Consider various pruned versions of the current tree
// and see if any are better than the best tree found so far
Let prunedTree be a copy of currentTree,
Let newAccuracy = accuracy of prunedTree on the TUNING set
// Is this pruned tree an improvement, based on the TUNE set?
bestAccuracy = newAccuracy
}
}
}
return bestTree
Finally, report the final pruned tree's size (same way as above) and run the fold's testing set examples through the pruned tree, reporting the tree's accuracy. At the end of this pruning 5-fold cross validation, report the average accuracy and tree size over all five folds.
On each of the above experiments, you should pick one of the folds, say the last one, and print out the corresponding induced tree. We have provided code for you to do this (see Provided Code section), but you can create your own if you desire. The printed trees usually do not fit very well in an 8 1/2 x 11 piece of paper, but it helps a lot to print them in landscape mode. The important thing here is to physically see what the tree looks like in addition its size and accuracy, so we do not expect to see extremely neat trees.
Program Output
Your program's output should be as follows:
ID3:
Fold # 0:
Tree Size:
Interior Nodes - nnn Leaf Nodes: -
nnn Total Nodes - nnn
Tree Accuracy:
nnn
Fold # 1:
Tree Size:
Interior Nodes - nnn Leaf Nodes: -
nnn Total Nodes - nnn
Tree Accuracy:
nnn
Fold # 2:
Tree Size:
Interior Nodes - nnn Leaf Nodes: -
nnn Total Nodes - nnn
Tree Accuracy:
nnn
Fold # 3:
Tree Size:
Interior Nodes - nnn Leaf Nodes: -
nnn Total Nodes - nnn
Tree Accuracy:
nnn
Fold # 4:
Tree Size:
Interior Nodes - nnn Leaf Nodes: -
nnn Total Nodes - nnn
Tree Accuracy:
nnn
Average Tree Size: nnn
Average Accuracy: nnn
Selected Tree:
(Print out your selected
tree)
PRUNING ID3:
Fold # 0:
Tree Size:
Interior Nodes - nnn Leaf Nodes: -
nnn Total Nodes - nnn
Tree Accuracy:
nnn
Fold # 1:
Tree Size:
Interior Nodes - nnn Leaf Nodes: -
nnn Total Nodes - nnn
Tree Accuracy:
nnn
Fold # 2:
Tree Size:
Interior Nodes - nnn Leaf Nodes: -
nnn Total Nodes - nnn
Tree Accuracy:
nnn
Fold # 3:
Tree Size:
Interior Nodes - nnn Leaf Nodes: -
nnn Total Nodes - nnn
Tree Accuracy:
nnn
Fold # 4:
Tree Size:
Interior Nodes - nnn Leaf Nodes: -
nnn Total Nodes - nnn
Tree Accuracy:
nnn
Average Tree Size: nnn
Average Accuracy: nnn
Selected Tree:
(Print out your tree)
Write a lab report comparing both of the above experiments. Did you find any significant differences between both experiments? If so, which trees are generally better, and why?
Provided Code
For this assignment you are provided with some java code that
will read the data files from disk, separate the testing, training and tuning
sets, printing the trees, and generate the necessary objects to handle the data.
You should download these files to your personal directory and familiarize
yourself with them. You are free to modify the code to your own liking or even
create your own routines instead of using these. We have also included in the
provided code a skeleton program file: runID3.java, from which you will create
your own program. You should not change any of the code that is already
in this file, but you can, and will of course find it necessary
to, add other methods and variables to it. When your program is complete,
you should be able to run it with the following command:
java runID3 led_Attributes led_Examples 5 33.33
where runID3 is the name of your program; led_Attributes is the name of the
attribute specification file; led_Examples is the name of the examples file; 5
is the number of folds in your cross validation experiments, and 33.33 is
the fraction (1/3) of the training sets that will be separated as the tuning
sets.
Giving Us Access to Your Java Code
The course staff (and nobody else
but you) will need to have access to the code you write. The following shows how
to
set this up. PLEASE NOTE THAT EVEN IF YOU DO NOT USE
THE COURSE MACHINES TO DEVELOP AND
TEST YOUR HW
SOLUTIONS, YOU MUST PLACE YOUR *.java AND *.class FILES IN THIS DIRECTORY,
SINCE OUR TAs WILL RUN YOUR CODE DURING GRADING.
In order to avoid conflicts when the TAs are grading one HW and you're
working on the next, each homework will
have a separate
directory named HWn, where n is the homework number. In addition, because we
often need to make
minor changes to your code to get it
to run, you will need to give us write permissions in this directory. The
following
steps can be used to do this:
mkdir ~/CS540-handin/
cd
~/CS540-handin/
/usr/afsws/bin/fs setacl .
system:anyuser none
/usr/afsws/bin/fs
setacl . gerson write
/usr/afsws/bin/fs
setacl . bo write
/usr/afsws/bin/fs setacl
. mariopi write
/usr/afsws/bin/fs
listacl
mkdir HW1
[remember, a new directory must be created for each
homework]
Be sure to save elsewhere a copy of the code you place in this directory, in
case something gets messed up during
grading. Also, once
the due date has passed, do not alter your code in your handin directory or you
may be charged
late points.
Please note that there is a space on either side of the dot (.) in the above
commands. Also, be sure to carefully match the required names (include the use
of upper and lowercase). This will greatly simplify the TAs job so they can
better spend their allotted hours on the more important aspects of grading.
After HW1, we are likely to subtract 2-3 points for misnamed files and
directories.
What To Turn In
Besides the answer to Part I of this homework, you
should hand in a hardcopy listing of your code and the program's required output
(the bold underlined parts of the instructions) as specified in
the Program Output section. You must also include a neatly written lab report
with the required information(also bold underlined). Points will
taken off if we do not have a hardcopy of your program, even if you placed
the corresponding files in your handin directory. It is extremely
important to strive for clarity and neatness in this and all future homeworks,
since it will definitely benefit you throughout the course.