CS 540 | Lecture Notes | Fall 1996 |
Critic <---------------- Sensors | | | | | | v v Learning Element <-----> Performance Element -----> Effectors | ^ | | | /-----------| v / Problem Generator
We will concentrate on the Learning Element
Most common criterion is predictive accuracy
The number of attributes (also called features) is fixed (positive, finite). Each attribute has a fixed, finite number of possible values.
The problem with this approach is that it doesn't necessarily generalize well if the examples are not "clustered."
Color / | \ / | \ green/ red \blue / | \ Size + Shape / \ / \ / \ / \ big/ small round \square / \ / \ - + Size - / \ / \ big/ \small / \ - +
function decision-tree-learning(examples, attributes, default) ;; examples is a list of training examples ;; attributes is a list of candidate attributes for the ;; current node ;; default is the default value for a leaf node if there ;; are no examples left if empty(examples) then return(default) if same-classification(examples) then return(class(examples)) if empty(attributes) then return(majority-classification(examples)) best = choose-attribute(attributes, examples) tree = new node with attribute best foreach value v of attribute best do v-examples = subset of examples with attribute best = v subtree = decision-tree-learning(v-examples, attributes - best, majority-classification(examples)) add a branch from tree to subtree with arc labeled v return(tree)
Some possibilities:
The ID3 algorithm uses the Max-Gain method of selecting the best attribute.
if x in P, then log |P| = log p questions needed, where p = |P|
if x in N, then log |N| = log n questions needed, where n = |N|
So, the expected number of questions that have to be asked is:
or, equivalently,
or, equivalently,
where %P = |P|/|S| = p/(p+n) (% positive examples in S), and %N = |N|/|S| = n/(p+n) (% negative examples in S).
I measures the information content in bits (i.e., number of yes/no questions that must be asked) associated with a set S of examples, which consists of the subset P of positive examples and subset N of negative examples.
Note: 0 <= I(P,N) <= 1, where 0 => no information, and 1 => maximum information.
Half the examples in S are positive and half are negative. Hence, %P = %N = 1/2. So,
I(1/2, 1/2) = -1/2 log 1/2 - 1/2 log 1/2 = -1/2 (log 1 - log 2) - 1/2 (log 1 - log 2) = -1/2 (0 - 1) - 1/2 (0 -1) = 1/2 + 1/2 = 1 => information content is large
Say all of the examples in S are positive and none are negative. Then, %P = 1, and %N = 0. So,
I(1,0) = -1 log 1 - 0 log 0 = -0 - 0 = 0 => information content is lowLow information content is desirable in order to make the smallest tree because low information content means that most of examples are classified the SAME, and therefore we would expect that the rest of the tree rooted at this node will be quite small to differentiate between the two classifications.
Now, measure the information gained by using a given attribute. That is, measure the difference in the information content of a node and the information content after a node splits up the examples based on a selected attribute's possible values. To do this, we need a measure of the information content after "splitting" a node's examples into its children based on a hypothesized attribute.
Given a node with a set of examples S = P union N, and an hypothesized attribute A that has m possible values, define Remainder(A) as the weighted sum of the information content of each subset of the examples that are associated with each child node as imposed by possible values of the attribute. More specifically, let
Si = subset of S with value i, i=1,...,m Pi = subset of Si that are + Ni = subset of Si that are - qi = |Si|/|S| = % of examples on branch i %Pi = |Pi|/|Si| = % of + examples on branch i %Ni = |Ni|/|Si| = % of - examples on branch i m ----- \ \ Remainder(A) = / qi I(%Pi, %Ni) / ----- i=1
So, Remainder(A) is a weighted sum of the information content at
each child node generated by that attribute. It measures the total
"disorder" or "inhomogeneity" of the children nodes.
0 <= Remainder(A) <= 1.
Now, measure the gain from using the attribute test at the current node, defined by:
The best attribute at a node is now defined as the attribute A with maximum Gain(A) of all the possible attributes that can be used at the node. Since at a given node I(%P, %N) is constant, this is equivalent to selecting the attribute A with minimum Remainder(A).
Example | Color | Shape | Size | Class |
---|---|---|---|---|
1 | red | square | big | + |
2 | blue | square | big | + |
3 | red | round | small | - |
4 | green | square | small | - |
5 | red | round | big | + |
6 | green | square | big | - |
Remainder(color) = 3/6 I(2/3,1/3) + 1/6 I(1/1,0/1) + 2/6 I(0/2,2/2) | | | | | | | | 1 of 6 is blue 2 of 6 are green | | | | | 1 of the red is negative | | | 2 of the red are positive | | 3 out of 6 are red = 1/2 * (-2/3 log 2/3 - 1/3 log 1/3) + 1/6 * (-1 log 1 - 0 log 0) + 2/6 * (-0 log 0 - 1 log 1) = 1/2 * (-2/3(log 2 - log 3) - 1/3(log 1 - log 3)) + 1/6 * 0 + 2/6 * 0 = 1/2 * (-2/3(1 - 1.58) - 1/3(0 - 1.58)) = 1/2 * 0.914 = 0.457 Gain(color) = I(3/6, 3/6) - Remainder(color) = 1.0 - 0.457 = 0.543 Remainder(shape) = 3/6 I(2/3, 1/3) + 3/6 I(1/3, 2/3) = 3/6 * 0.914 + 3/6 * 0.914 = 0.914 Gain(shape) = I(3/6, 3/6) - Remainder(shape) = 1.0 - 0.914 = 0.086 Remainder(size) = 4/6 I(3/4, 1/4) + 2/6 I(0/2, 2/2) = 0.541 Gain(size) = I(3/6, 3/6) - Remainder(size) = 1.0 - 0.541 = 0.459
Max(.543, .086, .459) = .543, so color is best. Make the root node's attribute color and partition the examples for the resulting children nodes as shown:
color / | \ / | \ R / G | B \ [1,3,5] [4,6] [2] +,-,+ -,- +The children associated with values green and blue are uniform, containing only - and + examples, respectively. So make these children leaves with classifications - and +, respectively.
Now recurse on red child node, containing three examples, [1,3,5], and two remaining attributes, [shape, size].
Remainder(shape) = 1/3 I(1/1, 0/1) + 2/3 I(1/2, 1/2) = 1/3 * 0 + 2/3 * 1 = 0.667 Gain(shape) = I(2/3, 1/3) - .667 = .914 - .667 = 0.247 Remainder(size) = 2/3 I(2/2, 0/2) + 1/3 I(0/1, 1/1) = 2/3 * 0 + 1/3 * 0 = 0 Gain(size) = I(2/3, 1/3) - 0 = 0.914
Max(.247, .914) = .914, so make size the attribute at this node. It's children are uniform in their classifications, so the final decision tree is:
color / | \ / | \ R/ G| B\ / | \ size - + / \ / \ big/ \small / \ + -
British Petroleum designed a decision tree for gas-oil separation for offshore oil platforms. Replaced a rule-based expert system.
Cessna designed an airplane flight controller using 90,000 examples and 20 attributes per example.
The last problem, irrelevant attributes, can result in overfitting the training example data. For example if the hypothesis space has many dimensions because there are a large number of attributes, then we may find meaningless regularity in the data that is irrelevant to the true, important, distinguishing features. Fix by pruning lower nodes in the decision tree. For example, if Gain of the best attribute at a node is below a threshold, stop and make this node a leaf rather than generating children nodes.
By constructing a rule for each path to a leaf yields an interpretation of what the tree means.
Using Tuning Sets for Parameter Setting
One special case of interest called Leave-1-Out where N-fold cross-validation uses N = number of examples. Good when the number of examples available is small (less than about 100)
Last modified November 8, 1996
Copyright © 1996 by Charles R. Dyer. All rights reserved.