Your program should read files that are in the ARFF format. In this format, each instance is described on a single line. The feature values are separated by commas, and the last value on each line is the class label of the instance. Each ARFF file starts with a header section describing the features and the class labels. Lines starting with '%' are comments. See the link above for a brief, but more detailed description of the ARFF format. Your program should handle numeric and nominal attributes, and simple ARFF files (i.e. don't worry about sparse ARFF files and instance weights). Example ARFF files are provided below.
Your program can assume that (i) the class attribute is binary, (ii) it is named 'class', and (iii) it is the last attribute listed in the header section.
Your program should implement a decision-learner according to the following guidelines:
m
training instances reaching the node, where m
is provided as input to the program, or (iii) no feature has positive information gain, or (iv) there are no more remaining candidate splits at the node.
dt-learn
and should accept three
command-line arguments as follows:dt-learn
<train-set-file> <test-set-file> m
dt-learn
that accepts the
command-line arguments and invokes the appropriate source-code program
and interpreter.
Here are examples of such scripts.
As output, your program should print the tree learned from the training set and its predictions for the test-set instances. For each instance in the test set, your program should print one line of output with spaces separating the fields. Each output line should list the predicted class label, and actual class label. This should be followed by a line listing the number of correctly classified test instances, and the total number of instances in the test set.
Here are the trees and test-set classifications that your code should produce when given heart_train.arff as the training set and heart_test.arff as the test set.
Here are the trees and test-set classifications that your code should produce when given diabetes_train.arff as the training set diabetes_test.arff as the test set . It is optional to print the number of training instances of each class after each node.
You should plot
points for training set sizes that represent 5%, 10%, 20%, 50% and 100% of the instances in each given training file. For
each training-set size (except the largest one), randomly draw 10 different training sets
and evaluate
each resulting decision tree model on the test set. For each training set
size, plot the average test-set accuracy and the minimum and maximum
test-set accuracy. Be sure to label the axes of your plots.
Set the stopping criterion m=4
for these experiments.
m
used in the stopping criteria. Show points for m
= 2, 5, 10 and 20.
Be sure to label the axes of your plots.
<Wisc username>_hw2.zip
.
Upload this zip file as Homework #2 at the course Canvas site.