University of Wisconsin - Madison | CS 540 Lecture Notes | C. R. Dyer |
Critic <---------------- Sensors | | | | | | v v Learning Element <-----> Performance Element -----> Effectors | ^ | | | /-----------| v / Problem Generator
We will concentrate on the Learning Element
Most common criterion is predictive accuracy
The number of attributes (also called features) is fixed (positive, finite). Each attribute has a fixed, finite number of possible values.
The problem with this approach is that it doesn't necessarily generalize well if the examples are not "clustered."
Color / | \ / | \ green/ red \blue / | \ Size + Shape / \ / \ / \ / \ big/ small round \square / \ / \ - + Size - / \ / \ big/ \small / \ - +
function decision-tree-learning(examples, attributes, default) ;; examples is a list of training examples ;; attributes is a list of candidate attributes for the ;; current node ;; default is the default value for a leaf node if there ;; are no examples left if empty(examples) then return(default) if same-classification(examples) then return(class(examples)) if empty(attributes) then return(majority-classification(examples)) best = choose-attribute(attributes, examples) tree = new node with attribute best foreach value v of attribute best do v-examples = subset of examples with attribute best = v subtree = decision-tree-learning(v-examples, attributes - best, majority-classification(examples)) add a branch from tree to subtree with arc labeled v return(tree)
Some possibilities:
The C5.0 algorithm uses the Max-Gain method of selecting the best attribute.
if x in P, then log2|P| = log2p questions needed, where p = |P|
if x in N, then log2|N| = log2n questions needed, where n = |N|
So, the expected number of questions that have to be asked is:
or, equivalently,
or, equivalently,
where %P = |P|/|S| = p/(p+n) (fraction of examples in S are positive), and %N = |N|/|S| = n/(p+n) (fraction of examples in S are negative).
H measures the information content or entropy in bits (i.e., number of yes/no questions that must be asked) associated with a set S of examples, which consists of the subset P of positive examples and subset N of negative examples.
Note: 0 <= H(P,N) <= 1, where 0 => no information, and 1 => maximum information.
Half the examples in S are positive and half are negative. Hence, %P = %N = 1/2. So,
H(1/2, 1/2) = -1/2 log2 1/2 - 1/2 log2 1/2 = -1/2 (log21 - log22) - 1/2 (log21 - log22) = -1/2 (0 - 1) - 1/2 (0 -1) = 1/2 + 1/2 = 1 => information content is large
Say all of the examples in S are positive and none are negative. Then, %P = 1, and %N = 0. So,
H(1,0) = -1 log21 - 0 log20 = -0 - 0 = 0 => information content is lowLow information content is desirable in order to make the smallest tree because low information content means that most of examples are classified the SAME, and therefore we would expect that the rest of the tree rooted at this node will be quite small to differentiate between the two classifications.
More generally, if there are K possible classes, y1, y2, ..., yK, of class variable Y, and at the current node there are N = n1 + n2 + ... + nK examples such that n1 examples are in class y1, ..., and nK examples are in class yK, then the entropy of the set of examples at the current node is
H(Y) = -Pr(Y=y1)log2Pr(Y=y1) - ... - Pr(Y=YK)log2Pr(Y=yK) = -p1 log2p1 - ... - pK log2pK
Conditional entropy is defined as a conditional probability of a class, Y, given a value, v, for an attribute (i.e., question), X. It is defined as
H(Y | X=v) = -Pr(Y=y1 | X=v) log2Pr(Y=y1 | X=v) - ... - Pr(Y=yk) log2Pr(Y=yK | X=v)
H(Y | X) = Pr(X=v1) H(Y | X=v1) + ... + Pr(X=vm)H(Y | X=vm)
Note that this is called the Remainder value in the textbook.
Now, measure the information gain, or mutual information of using a given attribute, X, with a class variable, Y. This measure the difference in the entropy of a node and the entropy after a node splits the examples based on a selected attribute's possible values. To do this, we need a measure of the information content after "splitting" a node's examples into its children based on a hypothesized attribute. But this is just what conditional entropy measures. So, information gain is defined as:
I(Y; X) = H(Y) - H(Y | X)
Note that information gain (aka mutual information) is symmetric since
I(Y; X) = H(Y) - H(Y | X) = H(X) - H(X | Y) = H(X; Y)
Example | Color | Shape | Size | Class |
---|---|---|---|---|
1 | red | square | big | + |
2 | blue | square | big | + |
3 | red | round | small | - |
4 | green | square | small | - |
5 | red | round | big | + |
6 | green | square | big | - |
H(class) = H(3/6, 3/6) = 1 H(class | color) = 3/6 * H(2/3, 1/3) + 1/6 H(1/1,0/1) + 2/6 H(0/2,2/2) | | | | | | | | 1 of 6 is blue 2 of 6 are green | | | | | 1 of the red is negative | | | 2 of the red are positive | | 3 out of 6 are red = 1/2 * (-2/3 log2 2/3 - 1/3 log2 1/3) + 1/6 * (-1 log21 - 0 log20) + 2/6 * (-0 log20 - 1 log21) = 1/2 * (-2/3(log22 - log23) - 1/3(log21 - log23)) + 1/6 * 0 + 2/6 * 0 = 1/2 * (-2/3(1 - 1.58) - 1/3(0 - 1.58)) = 1/2 * 0.914 = 0.457 I(class; color) = H(class) - H(class | color) = 1.0 - 0.457 = 0.543 H(class | shape) = 4/6 I(2/4, 2/4) + 2/6 I(1/2, 1/2) = 4/6 * 1.0 + 2/6 * 1.0 = 1.0 I(class; shape) = H(class) - H(class | shape) = 1.0 - 1.0 = 0.0 H(class | size) = 4/6 I(3/4, 1/4) + 2/6 I(0/2, 2/2) = 0.541 I(class; size) = H(class) - H(class | size) = 1.0 - 0.541 = 0.459
Max(0.543, 0.0, 0.459) = 0.543, so color is best. Make the root node's attribute color and partition the examples for the resulting children nodes as shown:
color / | \ / | \ R / G | B \ [1,3,5] [4,6] [2] +,-,+ -,- +The children associated with values green and blue are uniform, containing only - and + examples, respectively. So make these children leaves with classifications - and +, respectively.
Now recurse on red child node, containing three examples, [1,3,5], and two remaining attributes, [shape, size].
H(class | shape) = 1/3 H(1/1, 0/1) + 2/3 H(1/2, 1/2) = 1/3 * 0 + 2/3 * 1 = 0.667 I(class; shape) = H(2/3, 1/3) - .667 = .914 - .667 = 0.247 H(class | size) = 2/3 H(2/2, 0/2) + 1/3 H(0/1, 1/1) = 2/3 * 0 + 1/3 * 0 = 0 I(class; size) = H(2/3, 1/3) - 0 = 0.914
Max(.247, .914) = .914, so make size the attribute at this node. It's children are uniform in their classifications, so the final decision tree is:
color / | \ / | \ R/ G| B\ / | \ size - + / \ / \ big/ \small / \ + -
British Petroleum designed a decision tree for gas-oil separation for offshore oil platforms. Replaced a rule-based expert system.
Cessna designed an airplane flight controller using 90,000 examples and 20 attributes per example.
----- \ I(Pm,C1, ..., Pm,Ck) = - \ Pm,c log2 Pm,c / / ----- c=C1, ..., Ck
Note: When k=2, 0.0 <= I <= 1.0 but, in general, 0.0 <= I <= log2 k. The logarithm is always base 2 because entropy is a measure of the expected encoding length in bits.
Given an attribute A, let the number of possible values of A be d, with possible values v1, ..., vd. Let Nm,v be the number of examples at node m with attribute A's value = v. Let Nm,v,c be the number of examples at node m with attribute A's value = v and class = c. Define Pm,v = Nm,v / Nm and define Pm,v,c = Nm,v,c / Nm,v Then, we can finally define
----- ----- \ \ Remainder(A) = - \ Pm,v \ Pm,v,c log2 Pm,v,c / / / / ----- ----- v=v1, ..., vd c=C1, ..., Ck
Gain(A) = I(Pm,C1, ..., Pm,Ck) - Remainder(A)
The last problem, irrelevant attributes, can result in overfitting the training example data. For example if the hypothesis space has many dimensions because there are a large number of attributes, then we may find meaningless regularity in the data that is irrelevant to the true, important, distinguishing features. Fix by pruning lower nodes in the decision tree. For example, if Gain of the best attribute at a node is below a threshold, stop and make this node a leaf rather than generating children nodes.
One way to address the overfitting problem in decision-tree induction is to use a tuning set in conjunction with a pruning algorithm. The following is a greedy algorithm for doing this.
Let bestTree = the tree produced by C5.0 on the TRAINING set Let bestAccuracy = the accuracy of bestTree on the TUNING set Let progressMade = true while (progressMade) // Continue as long as improvement on TUNING SET { Set progressMade = false Let currentTree = bestTree For each interiorNode N (including the root) in currentTree { // Consider various pruned versions of the current tree // and see if any are better than the best tree found so far Let prunedTree be a copy of currentTree, except replace N by a leaf node whose label equals the majority class among TRAINING set examples that reached node N (break ties in favor of '-') Let newAccuracy = accuracy of prunedTree on the TUNING set // Is this pruned tree an improvement, based on the TUNE set? // When a tie, go with the smaller tree (Occam's Razor). If (newAccuracy >= bestAccuracy) { bestAccuracy = newAccuracy bestTree = prunedTree progressMade = true } } } return bestTree
By constructing a rule for each path to a leaf yields an interpretation of what the tree means.
Using Tuning Sets for Parameter Setting
One special case of interest called Leave-1-Out where N-fold cross-validation uses N = number of examples. Good when the number of examples available is small (less than about 100)
Copyright © 2001-2003 by Charles R. Dyer. All rights reserved.