Young Wu's Homepage

Prev: W3, Next: W5 , Practice Questions: M9 M10 M11 , Links: Canvas, Piazza, Zoom, TopHat (744662)

Tools

📗 Calculator:

📗 Canvas:

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

Slide:

# K Nearest Neighbor

📗 K Nearest Neighbor algorithm (not related to K Means) is a simple supervised learning algorithm that uses the \(K\) items from the training set that is the closest to a new item to predict the label of the new item: Link, Wikipedia.

➩ 1 nearest neighbor copies the label of the closest item.

➩ 3 nearest neighbor finds the majority label of the three closest items.

➩ N nearest neighbor uses the majority label of the training set (of size N) to predict the label of every new item.

📗 The distance measure used in K nearest neighbor can be any of the \(L_{p}\) distances.

➩ \(L_{1}\) Manhattan distance.

➩ \(L_{2}\) Euclidean distance.

➩ \(L_{\infty}\) Maximum distance from all features.

TopHat Discussion

ID:

📗 [1 points] Find the value of K for K nearest neighbor that is the most appropriate for the dataset. Click on an existing point to perform leave-one-out cross validation, and click on a new point to find the nearest neighbor.

# Training Set Accuracy

📗 For 1NN, the accuracy of the prediction on the training set is always 100 percent.

📗 When comparing the accuracy of KNN for different values of K (called hyperparameter tuning), training set accuracy is not a great meausre.

📗 K fold cross validation is often used instead to measure the performance of a supervised learning algorithm on the training set.

➩ The training set is divided into K groups (K can be different from the K in KNN).

➩ Train the model on K - 1 groups and compute the accuracy on the remaining 1 group.

➩ Repeat this process K times.

📗 K fold cross validation with \(K = n\) is called Leave One Out Cross Validation (LOOCV).

TopHat Quiz

ID:

📗 [4 points] Given the following training data, what is the fold cross validation accuracy if NN (Nearest Neighbor) classifier with Manhattan distance is used. The first fold is the first instances, the second fold is the next instances, etc. Break the tie (in distance) by using the instance with the smaller index. Enter a number between 0 and 1.

\(x_{i}\)
\(y_{i}\)

📗 Answer: .

# Decision Tree

📗 Another simple supervised learning algorithm is decision trees.

➩ Find the feature that is the most informative.

➩ Split the training set into subsets based on this feature.

➩ Repeat on each of the subset recursively until all features or labels in the subset are the same.

TopHat Discussion

📗 5 kids are wearing either green or red hats in a party: they can see every other kid's hat but not their own.

➩ Dad said to everyone: at least one of you is wearing green hat.

➩ Dad asked everyone: do you know the color of your hat?

➩ Everyone said no.

➩ Dad asked everyone: do you know the color of your hat?

➩ Everyone said no.

➩ Dad asked everyone: do you know the color of your hat?

➩ n kids said yes (no one lied).

📗 How many kids are wearing green hats?

➩ A: 1

➩ B: 2

➩ C: 3

➩ D: 4

➩ E: 5

TopHat Discussion

ID:

📗 [1 points] Find the thresholds for the decision tree that iteratively optimize (i) the total number of mistakes and (ii) the information gain. Move the black circle to the threshold for the first feature, and move the two black squares to the thresholds for the second feature.

Total number of mistakes: ???
Information gain for the first feature (???): ???
Information gain for the second feature (???): ???
Information gain for the second feature (???): ???

TopHat Discussion

ID:

📗 [1 points] Find the threshold for the decision stump that maximize the information gain. Move the black circle to the threshold.

Information gain: ???

# Measure of Uncertainty

📗 Let \(p_{0}\) be the fraction of items in a training with label \(0\) and \(p_{1}\) be the fraction of items with label \(1\).

➩ If \(p_{0} = 0, p_{1} = 1\), the outcome is certain, so there is no uncertainty, the measure of uncertainty should be 0.

➩ If \(p_{0} = 1, p_{1} = 0\), the outcome is certain, so there is no uncertainty, the measure of uncertainty should be 0.

➩ If \(p_{0} = \dfrac{1}{2} , p_{1} = \dfrac{1}{2}\), the outcome is the most uncertain, so the measure of uncertainty should be at its maximum value, for example \(1\).

📗 One measure of uncertainty that satisfies the above condition is entropy: \(H = p_{0} \log_{2} \left(\dfrac{1}{p_{0}}\right) + p_{1} \log_{2} \left(\dfrac{1}{p_{1}}\right)\) or \(H = -p_{0} \log_{2}\left(p_{0}\right) - p_{1} \log_{2}\left(p_{1}\right)\).

📗 The realized value of something uncertain is more informative than the value of something certain.

Example

📗 Plot the function: as a function of from to .

# Entropy

📗 In general, if there are K classes (\(y = \left\{1, 2, ..., K\right\}\)), and \(p_{y}\) is the fraction of the training set with label \(y\), then the entropy of \(y\) is \(H\left(y\right) = -p_{1} \log_{2}\left(p_{1}\right) -p_{2} \log_{2}\left(p_{2}\right) - ... -p_{K} \log_{2}\left(p_{K}\right)\).

📗 Conditional entropy is the entropy of the conditional distribution: \(H\left(y|x\right) = q_{1} H\left(y|x=1\right) + q_{2} H\left(y|x=2\right) + ... + q_{K_{x}} H\left(y|x=K_{x}\right)\), where \(K_{x}\) is the number of possible values of the feature \(x\) and \(q_{x}\) is the fraction of training data with feature \(x\). \(H\left(y|x=k\right) = -p_{y=1|x=k} \log_{2}\left(p_{y=1|x=k}\right) -p_{y=2|x=k} \log_{2}\left(p_{y=2|x=k}\right) - ... -p_{y=K|x=k} \log_{2}\left(p_{y=K|x=k}\right)\), where \(p_{y|x}\) is the fraction of training data with label \(y\) among the items with feature \(x\).

# Information Gain

📗 The information gain from a feature \(x\) is defined as the difference between the entropy of the label and the conditional entropy of the label given that feature: \(I\left(y|x\right) = H\left(y\right) - H\left(y|x\right)\).

📗 The larger the information gain, the larger the reduction in uncertainty, and the better predictor the feature is.

📗 A decision tree iteratively splits the training set based on the feature with the largest information gain. This algorithm is called ID3 (Iterative Dichotomiser 3): Wikipedia.

➩ Find feature \(j\) so that \(I\left(y | x_{ij}\right)\) is the largest.

➩ Split the training set into the set with \(x_{ij} = 1\), \(x_{ij} = 2\), ..., \(x_{ij} = K_{j}\).

➩ Repeat the process on each of the subsets to create a tree.

📗 For continuous features, construct all possible splits and find the one that yields the largest information gain: this is the same as creating new variables \(z_{ij} = 1\) if \(x_{ij} \leq t_{j}\) and \(z_{ij} = 0\) if \(x_{ij} > t_{j}\).

➩ In practice, the efficient way to create the binary splits uses the midpoint between items with different labels.

TopHat Quiz

(Past Exam Question) ID:

📗 [4 points] "It" has a house with many doors. A random door is about to be opened with equal probability. Doors to have monsters that eat people. Doors to are safe. With sufficient bribe, Pennywise will answer your question "Will door 1 be opened?" What's the information gain (also called mutual information) between Pennywise's answer and your encounter with a monster?

📗 Answer: .

# Pruning

📗 Decision trees can be pruned by replacing a subtree by a leaf when the accuracy on a validation set with the leaf is equal or higher than the accuracy with the subtree. This method is called Reduced Error Pruning: Wikipedia.

➩ A validation set is a subset of the training set that is set aside when training the decision tree and only used for pruning the tree.

➩ The items use to train the decision tree cannot be used to prune the tree.

TopHat Quiz

ID:

📗 [1 points] Given the following decision tree, click on each node to check the validation set accuracy when it is pruned (node is replace by a leaf with the majority label of the training subset for that node), and find the decision tree after pruning is done. Validation set accuracy: .

# Random Forest

📗 Smaller training sets can be created by sampling from the complete training set, and different decision trees can be trained on these smaller training sets (and only using a subset of the features). This is called bagging (or Bootstrap AGGregatING): Link, Wikipedia.

➩ Training items are sampled with replacement.

➩ Features are sampled without replacement.

📗 The label of a new item can be predicted based on the majority vote from the decision trees training on these smaller training sets. These trees form a random forest: Wikipedia.

# Adaptive Boosting

📗 Decision trees can also be trained sequentially. The items that are classified incorrectly by the previous trees are made more important when training the next decision tree.

📗 Each training item has a weight representing how important they are when training each decision tree, and the weights can be updated based on the error made by the previous decision trees. This is called AdaBoost (ADAptive BOOSTing): Wikipedia.

📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Yingyu Liang, and Charles Dyer.

📗 Please use Ctrl+F5 or Shift+F5 or Shift+Command+R or Incognito mode or Private Browsing to refresh the cached JavaScript.

📗 If you missed the TopHat quiz questions, please submit the form: Form.

📗 Anonymous feedback can be submitted to: Form.

Prev: W3, Next: W5

Last Updated: January 19, 2026 at 9:18 PM