Young Wu's Homepage

Prev: L5, Next: L7
Course Links: Canvas, Piazza, TopHat (212925)
Zoom Links: MW 4:00, TR 1:00, TR 2:30.

Tools

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 For visibility, you can resize all diagrams on page to have a maximum height that is percent of the screen height: .

📗 Calculator:

📗 Canvas:

Slide:

# Decision Tree

📗 Another simple supervised learning algorithm is decision trees.

➩ Find the feature that is the most informative.

➩ Split the training set into subsets based on this feature.

➩ Repeat on each of the subset recursively until all features or labels in the subset are the same.

TopHat Discussion

📗 5 kids are wearing either green or red hats in a party: they can see every other kid's hat but not their own.

➩ Dad said to everyone: at least one of you is wearing green hat.

➩ Dad asked everyone: do you know the color of your hat?

➩ Everyone said no.

➩ Dad asked everyone: do you know the color of your hat?

➩ Everyone said no.

➩ Dad asked everyone: do you know the color of your hat?

➩ n kids said yes (no one lied).

📗 How many kids are wearing green hats?

➩ A: 1

➩ B: 2

➩ C: 3

➩ D: 4

➩ E: 5

TopHat Discussion

ID:

📗 [1 points] Find the thresholds for the decision tree that iteratively optimize (i) the total number of mistakes and (ii) the information gain. Move the black circle to the threshold for the first feature, and move the two black squares to the thresholds for the second feature.

Total number of mistakes: ???
Information gain for the first feature (???): ???
Information gain for the second feature (???): ???
Information gain for the second feature (???): ???

TopHat Discussion

ID:

📗 [1 points] Find the threshold for the decision stump that maximize the information gain. Move the black circle to the threshold.

Information gain: ???

# Measure of Uncertainty

📗 Let \(p_{0}\) be the fraction of items in a training with label \(0\) and \(p_{1}\) be the fraction of items with label \(1\).

➩ If \(p_{0} = 0, p_{1} = 1\), the outcome is certain, so there is no uncertainty, the measure of uncertainty should be 0.

➩ If \(p_{0} = 1, p_{1} = 0\), the outcome is certain, so there is no uncertainty, the measure of uncertainty should be 0.

➩ If \(p_{0} = \dfrac{1}{2} , p_{1} = \dfrac{1}{2}\), the outcome is the most uncertain, so the measure of uncertainty should be at its maximum value, for example \(1\).

📗 One measure of uncertainty that satisfies the above condition is entropy: \(H = p_{0} \log_{2} \left(\dfrac{1}{p_{0}}\right) + p_{1} \log_{2} \left(\dfrac{1}{p_{1}}\right)\) or \(H = -p_{0} \log_{2}\left(p_{0}\right) - p_{1} \log_{2}\left(p_{1}\right)\).

📗 The realized value of something uncertain is more informative than the value of something certain.

Example

📗 Plot the function: as a function of from to .

# Entropy

📗 In general, if there are K classes (\(y = \left\{1, 2, ..., K\right\}\)), and \(p_{y}\) is the fraction of the training set with label \(y\), then the entropy of \(y\) is \(H\left(y\right) = -p_{1} \log_{2}\left(p_{1}\right) -p_{2} \log_{2}\left(p_{2}\right) - ... -p_{K} \log_{2}\left(p_{K}\right)\).

📗 Conditional entropy is the entropy of the conditional distribution: \(H\left(y|x\right) = q_{1} H\left(y|x=1\right) + q_{2} H\left(y|x=2\right) + ... + q_{K_{x}} H\left(y|x=K_{x}\right)\), where \(K_{x}\) is the number of possible values of the feature \(x\) and \(q_{x}\) is the fraction of training data with feature \(x\). \(H\left(y|x=k\right) = -p_{y=1|x=k} \log_{2}\left(p_{y=1|x=k}\right) -p_{y=2|x=k} \log_{2}\left(p_{y=2|x=k}\right) - ... -p_{y=K|x=k} \log_{2}\left(p_{y=K|x=k}\right)\), where \(p_{y|x}\) is the fraction of training data with label \(y\) among the items with feature \(x\).

# Information Gain

📗 The information gain from a feature \(x\) is defined as the difference between the entropy of the label and the conditional entropy of the label given that feature: \(I\left(y|x\right) = H\left(y\right) - H\left(y|x\right)\).

📗 The larger the information gain, the larger the reduction in uncertainty, and the better predictor the feature is.

📗 A decision tree iteratively splits the training set based on the feature with the largest information gain. This algorithm is called ID3 (Iterative Dichotomiser 3): Wikipedia.

➩ Find feature \(j\) so that \(I\left(y | x_{ij}\right)\) is the largest.

➩ Split the training set into the set with \(x_{ij} = 1\), \(x_{ij} = 2\), ..., \(x_{ij} = K_{j}\).

➩ Repeat the process on each of the subsets to create a tree.

📗 For continuous features, construct all possible splits and find the one that yields the largest information gain: this is the same as creating new variables \(z_{ij} = 1\) if \(x_{ij} \leq t_{j}\) and \(z_{ij} = 0\) if \(x_{ij} > t_{j}\).

➩ In practice, the efficient way to create the binary splits uses the midpoint between items with different labels.

TopHat Quiz

(Past Exam Question) ID:

📗 [4 points] "It" has a house with many doors. A random door is about to be opened with equal probability. Doors to have monsters that eat people. Doors to are safe. With sufficient bribe, Pennywise will answer your question "Will door 1 be opened?" What's the information gain (also called mutual information) between Pennywise's answer and your encounter with a monster?

📗 Answer: .

# Pruning

📗 Decision trees can be pruned by replacing a subtree by a leaf when the accuracy on a validation set with the leaf is equal or higher than the accuracy with the subtree. This method is called Reduced Error Pruning: Wikipedia.

➩ A validation set is a subset of the training set that is set aside when training the decision tree and only used for pruning the tree.

➩ The items use to train the decision tree cannot be used to prune the tree.

TopHat Quiz

ID:

📗 [1 points] Given the following decision tree, click on each node to check the validation set accuracy when it is pruned (node is replace by a leaf with the majority label of the training subset for that node), and find the decision tree after pruning is done. Validation set accuracy: .

# Random Forest

📗 Smaller training sets can be created by sampling from the complete training set, and different decision trees can be trained on these smaller training sets (and only using a subset of the features). This is called bagging (or Bootstrap AGGregatING): Link, Wikipedia.

➩ Training items are sampled with replacement.

➩ Features are sampled without replacement.

📗 The label of a new item can be predicted based on the majority vote from the decision trees training on these smaller training sets. These trees form a random forest: Wikipedia.

# Adaptive Boosting

📗 Decision trees can also be trained sequentially. The items that are classified incorrectly by the previous trees are made more important when training the next decision tree.

📗 Each training item has a weight representing how important they are when training each decision tree, and the weights can be updated based on the error made by the previous decision trees. This is called AdaBoost (ADAptive BOOSTing): Wikipedia.

📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Yingyu Liang, and Charles Dyer.

📗 Content from note blocks marked "optional" and content from Wikipedia and other demo links are helpful for understanding the materials, but will not be explicitly tested on the exams.

📗 Please use Ctrl+F5 or Shift+F5 or Shift+Command+R or Incognito mode or Private Browsing to refresh the cached JavaScript.

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.

📗 Anonymous feedback can be submitted to: Form.

Prev: L5, Next: L7

Last Updated: July 01, 2025 at 1:47 AM