Young Wu's Homepage

Prev: L4, Next: L6
Course Links: Canvas, Piazza, TopHat (212925)
Zoom Links: MW 4:00, TR 1:00, TR 2:30.

Tools

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 For visibility, you can resize all diagrams on page to have a maximum height that is percent of the screen height: .

📗 Calculator:

📗 Canvas:

Slide:

# K Means Clustering

📗 K-means clustering (2-means, 3-means, ...) iteratively updates a fixed number of cluster centers: Link, Wikipedia.

➩ Start with K random cluster centers.

➩ Assign each item to its closest center.

➩ Update all cluster centers as the center of its items.

TopHat Discussion

ID:

📗 [1 points] Given the following dataset, use k-means clustering to divide the points into groups. Move the centers and click on the center to move it to the center of the points closest to the center.

Total distortion:

# Total Distortion

📗 K means clustering tries to minimize the total distances of all items to their cluster centers. The total distance is called total distortion or inertia.

📗 Suppose the cluster centers are \(c_{1}, c_{2}, ..., c_{K}\), and the cluster center for an item \(x_{i}\) is \(c\left(x_{i}\right)\) (one of \(c_{1}, c_{2}, ..., c_{K}\)), then the total distortion is \(\left\|x_{1} - c\left(x_{1}\right)\right\|_{2}^{2} + \left\|x_{2} - c\left(x_{2}\right)\right\|_{2}^{2} + ... + \left\|x_{n} - c\left(x_{n}\right)\right\|_{2}^{2}\).

Math Note

📗 The K means procedure is similar to the gradient descent method to minimize the total distortion: Wikipedia.

➩ The gradient of the total distortion with respect to the cluster centers is \(-2 \displaystyle\sum_{x : c\left(x\right) = c_{k}} \left(x - c_{k}\right)\), setting this to \(0\) to obtain the update step formula \(c_{k} = \dfrac{1}{n_{k}} \displaystyle\sum_{x: c\left(x\right) = c_{k}} x\), where \(n_{k}\) is the number of items that belongs to cluster \(k\), and the sum is over all items in cluster \(k\).

➩ One issue with some optimization algorithms like gradient descent is that they sometimes converge to local minima that are not the global minimum. This is also the case for K means clustering: Wikipedia.

📗 [1 points] Move the points to see the derivatives (slope of tangent line) of the function \(x^{2}\):

Point: 0
Learning rate: 0.5
Derivative: 0

TopHat Quiz

(Past Exam Question) ID:

📗 [3 points] Perform k-means clustering on six points: \(x_{1}\) = , \(x_{2}\) = , \(x_{3}\) = , \(x_{4}\) = , \(x_{5}\) = , \(x_{6}\) = . Initially the cluster centers are at \(c_{1}\) = , \(c_{2}\) = . Run k-means for one iteration (assign the points, update center once and reassign the points once). Break ties in distances by putting the point in the cluster with the smaller index (i.e. favor cluster 1). What is the reduction in total distortion? Use Euclidean distance and calculate the total distortion by summing the squares of the individual distances to the center.

📗 Note: the red points are the cluster centers and the other points are the training items.

📗 Answer: .

# Number of Clusters

📗 There are a few ways to choose the number of clusters K.

➩ K can be chosen based on prior knowledge about the items.

➩ K cannot be chosen by minimizing total distortion since the total distortion is always minimized at \(0\) when \(K = n\) (number of clusters = number of training items).

➩ K can be chosen by minimizing total distortion plus some regularizer, for example, \(c \cdot m K \log\left(n\right)\) where \(c\) is a fixed constant and \(m\) is the number of features for each item.

TopHat Quiz

📗 [1 points] Upload an image and use K-means clustering to group the pixels into \(K\) clusters. Find an appropriate value of \(K\):
. Click on the image to perform the clustering for iterations.

Number of clusters:

# Initial Clusters

📗 There are a few ways to initialize the clusters: Link.

➩ The initial cluster centers can be randomly chosen in the domain.

➩ The initial cluster centers can be randomly chosen as \(K\) distinct items.

➩ The first cluster center can be a random item, the second cluster center can be the item that is the farthest from the first item, the third cluster center can be the item that is the farthest from the first two items, ...

# K Nearest Neighbor

📗 K Nearest Neighbor algorithm (not related to K Means) is a simple supervised learning algorithm that uses the \(K\) items from the training set that is the closest to a new item to predict the label of the new item: Link, Wikipedia.

➩ 1 nearest neighbor copies the label of the closest item.

➩ 3 nearest neighbor finds the majority label of the three closest items.

➩ N nearest neighbor uses the majority label of the training set (of size N) to predict the label of every new item.

📗 The distance measure used in K nearest neighbor can be any of the \(L_{p}\) distances.

➩ \(L_{1}\) Manhattan distance.

➩ \(L_{2}\) Euclidean distance.

➩ \(L_{\infty}\) Maximum distance from all features.

TopHat Discussion

ID:

📗 [1 points] Find the value of K for K nearest neighbor that is the most appropriate for the dataset. Click on an existing point to perform leave-one-out cross validation, and click on a new point to find the nearest neighbor.

# Training Set Accuracy

📗 For 1NN, the accuracy of the prediction on the training set is always 100 percent.

📗 When comparing the accuracy of KNN for different values of K (called hyperparameter tuning), training set accuracy is not a great meausre.

📗 K fold cross validation is often used instead to measure the performance of a supervised learning algorithm on the training set.

➩ The training set is divided into K groups (K can be different from the K in KNN).

➩ Train the model on K - 1 groups and compute the accuracy on the remaining 1 group.

➩ Repeat this process K times.

📗 K fold cross validation with \(K = n\) is called Leave One Out Cross Validation (LOOCV).

TopHat Quiz

ID:

📗 [4 points] Given the following training data, what is the fold cross validation accuracy if NN (Nearest Neighbor) classifier with Manhattan distance is used. The first fold is the first instances, the second fold is the next instances, etc. Break the tie (in distance) by using the instance with the smaller index. Enter a number between 0 and 1.

\(x_{i}\)
\(y_{i}\)

📗 Answer: .

📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Yingyu Liang, and Charles Dyer.

📗 Content from note blocks marked "optional" and content from Wikipedia and other demo links are helpful for understanding the materials, but will not be explicitly tested on the exams.

📗 Please use Ctrl+F5 or Shift+F5 or Shift+Command+R or Incognito mode or Private Browsing to refresh the cached JavaScript.

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.

📗 Anonymous feedback can be submitted to: Form.

Prev: L4, Next: L6

Last Updated: July 01, 2025 at 1:47 AM