Wisc ID for in-class quiz: (if your wisc email is "test@wisc.edu", please enter "test")
Token: (will be given during the lectures) 6 id,answer_id;token,answer_check
📗 Another simple supervised learning algorithm is decision trees.
➩ Find the feature that is the most informative.
➩ Split the training set into subsets based on this feature.
➩ Repeat on each of the subset recursively until all features or labels in the subset are the same.
In-class Discussion
📗 5 kids are wearing either green or red hats in a party: they can see every other kid's hat but not their own.
➩ Dad said to everyone: at least one of you is wearing green hat.
➩ Dad asked everyone: do you know the color of your hat?
➩ Everyone said no.
➩ Dad asked everyone: do you know the color of your hat?
➩ Everyone said no.
➩ Dad asked everyone: do you know the color of your hat?
➩ n kids said yes (no one lied).
📗 How many kids are wearing green hats?
A. 1
B. 2
C. 3
D. 4
E. 5
[Note] Why?
Other students' answers:
In-class Discussion
ID:
📗 [1 points] Find the thresholds for the decision tree that iteratively optimize (i) the total number of mistakes and (ii) the information gain. Move the black circle to the threshold for the first feature, and move the two black squares to the thresholds for the second feature.
Total number of mistakes: ???
Information gain for the first feature (???): ???
Information gain for the second feature (???): ???
Information gain for the second feature (???): ???
[Note] How to find the best splits?
Other students' answers:
In-class Discussion
ID:
📗 [1 points] Find the threshold for the decision stump that maximize the information gain. Move the black circle to the threshold.
📗 Let \(p_{0}\) be the fraction of items in a training with label \(0\) and \(p_{1}\) be the fraction of items with label \(1\).
➩ If \(p_{0} = 0, p_{1} = 1\), the outcome is certain, so there is no uncertainty, the measure of uncertainty should be 0.
➩ If \(p_{0} = 1, p_{1} = 0\), the outcome is certain, so there is no uncertainty, the measure of uncertainty should be 0.
➩ If \(p_{0} = \dfrac{1}{2} , p_{1} = \dfrac{1}{2}\), the outcome is the most uncertain, so the measure of uncertainty should be at its maximum value, for example \(1\).
📗 One measure of uncertainty that satisfies the above condition is entropy: \(H = p_{0} \log_{2} \left(\dfrac{1}{p_{0}}\right) + p_{1} \log_{2} \left(\dfrac{1}{p_{1}}\right)\) or \(H = -p_{0} \log_{2}\left(p_{0}\right) - p_{1} \log_{2}\left(p_{1}\right)\).
📗 The realized value of something uncertain is more informative than the value of something certain.
📗 In general, if there are K classes (\(y = \left\{1, 2, ..., K\right\}\)), and \(p_{y}\) is the fraction of the training set with label \(y\), then the entropy of \(y\) is \(H\left(y\right) = -p_{1} \log_{2}\left(p_{1}\right) -p_{2} \log_{2}\left(p_{2}\right) - ... -p_{K} \log_{2}\left(p_{K}\right)\).
📗 Conditional entropy is the entropy of the conditional distribution: \(H\left(y|x\right) = q_{1} H\left(y|x=1\right) + q_{2} H\left(y|x=2\right) + ... + q_{K_{x}} H\left(y|x=K_{x}\right)\), where \(K_{x}\) is the number of possible values of the feature \(x\) and \(q_{x}\) is the fraction of training data with feature \(x\). \(H\left(y|x=k\right) = -p_{y=1|x=k} \log_{2}\left(p_{y=1|x=k}\right) -p_{y=2|x=k} \log_{2}\left(p_{y=2|x=k}\right) - ... -p_{y=K|x=k} \log_{2}\left(p_{y=K|x=k}\right)\), where \(p_{y|x}\) is the fraction of training data with label \(y\) among the items with feature \(x\).
📗 The information gain from a feature \(x\) is defined as the difference between the entropy of the label and the conditional entropy of the label given that feature: \(I\left(y|x\right) = H\left(y\right) - H\left(y|x\right)\).
📗 The larger the information gain, the larger the reduction in uncertainty, and the better predictor the feature is.
📗 A decision tree iteratively splits the training set based on the feature with the largest information gain. This algorithm is called ID3 (Iterative Dichotomiser 3): Wikipedia.
➩ Find feature \(j\) so that \(I\left(y | x_{ij}\right)\) is the largest.
➩ Split the training set into the set with \(x_{ij} = 1\), \(x_{ij} = 2\), ..., \(x_{ij} = K_{j}\).
➩ Repeat the process on each of the subsets to create a tree.
📗 For continuous features, construct all possible splits and find the one that yields the largest information gain: this is the same as creating new variables \(z_{ij} = 1\) if \(x_{ij} \leq t_{j}\) and \(z_{ij} = 0\) if \(x_{ij} > t_{j}\).
➩ In practice, the efficient way to create the binary splits uses the midpoint between items with different labels.
In-class Quiz
ID:
📗 [4 points] "It" has a house with many doors. A random door is about to be opened with equal probability. Doors to have monsters that eat people. Doors to are safe. With sufficient bribe, Pennywise will answer your question "Will door 1 be opened?" What's the information gain (also called mutual information) between Pennywise's answer and your encounter with a monster?
📗 Answer: .
[Note] Use the space to explain the steps or just take notes:
📗 Decision trees can be pruned by replacing a subtree by a leaf when the accuracy on a validation set with the leaf is equal or higher than the accuracy with the subtree. This method is called Reduced Error Pruning: Wikipedia.
➩ A validation set is a subset of the training set that is set aside when training the decision tree and only used for pruning the tree.
➩ The items use to train the decision tree cannot be used to prune the tree.
In-class Quiz
ID:
📗 [1 points] Given the following decision tree, click on each node to check the validation set accuracy when it is pruned (node is replace by a leaf with the majority label of the training subset for that node), and find the decision tree after pruning is done. Validation set accuracy: .
📗 Smaller training sets can be created by sampling from the complete training set, and different decision trees can be trained on these smaller training sets (and only using a subset of the features). This is called bagging (or Bootstrap AGGregatING): Link, Wikipedia.
➩ Training items are sampled with replacement.
➩ Features are sampled without replacement.
📗 The label of a new item can be predicted based on the majority vote from the decision trees training on these smaller training sets. These trees form a random forest: Wikipedia.
📗 Decision trees can also be trained sequentially. The items that are classified incorrectly by the previous trees are made more important when training the next decision tree.
📗 Each training item has a weight representing how important they are when training each decision tree, and the weights can be updated based on the error made by the previous decision trees. This is called AdaBoost (ADAptive BOOSTing): Wikipedia.
📗 If you have questions, please use (i) Zoom chat, (ii) Piazza: Link, (iii) Office hours and discussion sessions. Please do NOT use Canvas mail and use email only to the course instructor (not TAs) for grading issues.
Additional In-class Discussion
📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:
Notes (not visible to other students):
Submit your answer to see other students answers (click the submit button to refresh):
Additional In-class Quiz
📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:
A.
B.
C.
D.
E.
Notes (not visible to other students):
Submit your answer to see other students answers (click the submit button to refresh):
📗 To get full points on the in-class quizzes for a lecture:
➩ Submit relevant answers to the questions discussed during the lecture: incorrect answers are okay.
➩ Some questions require [notes] to earn the point.
➩ Some questions require special ID (given during the lecture) to earn the point.
➩ Do not submit answers to questions that are not discussed during the lectures. Each such submission will result in a deduction of one point.
➩ Submissions after the lecture, before the midterm (first 14 lectures) and the final exam (last 14 lectures), are accepted. After the exams, no in-class quiz submissions will be accepted.
➩ The grade on Canvas Assignment Q6 is computed as number of points divided by the number of questions asked (out of 1) and updated on Canvas every weekend.
📗 If there are any issues with submission on the website, please use this Google form: Link.
📗 Bonus point opportunities during a few lectures (added to in-class quiz above 20 points).
📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Blerina Gkotse, Yudong Chen, Yingyu Liang, Charles Dyer. Some content are generated using Copilot .