Young Wu's Homepage

Prev: L15, Next: L17

Zoom: Link, Piazza: Link, Google Form: Link.

Wisc ID for in-class quiz: (if your wisc email is "test@wisc.edu", please enter "test")
Token: (will be given during the lectures)

Slide:

# Unsupervised Learning

📗 Supervised learning: \(\left(x_{1}, y_{1}\right), \left(x_{2}, y_{2}\right), ..., \left(x_{n}, y_{n}\right)\).

📗 Unsupervised learning: \(\left(x_{1}\right), \left(x_{2}\right), ..., \left(x_{n}\right)\).

➩ Clustering: separates items into groups.

➩ Novelty (outlier) detection: finds items that are different (two groups).

➩ Dimensionality reduction: represents each item by a lower dimensional feature vector while maintaining key characteristics.

📗 Unsupervised learning applications:

➩ Google news.

➩ Google photo.

➩ Image segmentation.

➩ Text processing.

➩ Data visualization.

➩ Efficient storage.

➩ Noise removal.

# Hierarchical Clustering

📗 Hierarchical clustering iteratively merges groups: Link, Wikipedia.

➩ Start with each items as a cluster.

➩ Merge clusters that are closest to each other.

➩ Result in a binary tree with close clusters as children.

In-class Discussion

ID:

📗 [1 points] Given the following dataset, use hierarchical clustering to divide the points into groups. Drag one point to another point to merge them into one cluster. Click on a point to move it out of the cluster.

Number of clusters:

[Q1] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Other students' answers:

# Distance between Points

📗 Distance between points in \(m\) dimensional space is usually measured by Euclidean distance (also called \(L_{2}\) distance).

➩ Euclidean distance (\(L_{2}\)): \(\left\|x_{i} - x_{j}\right\|_{2} = \sqrt{\left(x_{i 1} - x_{j 1}\right)^{2} + \left(x_{i 2} - x_{j 2}\right)^{2} + ... + \left(x_{i m} - x_{j m}\right)^{2}}\): Wikipedia.

📗 Distances can also be measured by \(L_{1}\) or \(L_{\infty}\) distances.

➩ Manhattan distance (\(L_{1}\)): \(\left\|x_{i} - x_{j}\right\|_{1} = \left| x_{i 1} - x_{j 1} \right| + \left| x_{i 2} - x_{j 2} \right| + ... + \left| x_{i m} - x_{j m} \right|\): Wikipedia.

➩ Chebyshev distance (\(L_{\infty}\)): \(\left\|x_{i} - x_{j}\right\|_{\infty} = \displaystyle\max\left\{\left| x_{i 1} - x_{j 1} \right|, \left| x_{i 2} - x_{j 2} \right|, ..., \left| x_{i m} - x_{j m} \right|\right\}\): Wikipedia

In-class Discussion

📗 [1 points] Move the green point so that it is within 100 pixels of the red point measured by the distance. Highlight the region containing all points within 100 pixels of the red point.

Distance:

[Q2] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Other students' answers:

# Distance between Clusters

📗 Distance between clusters (group of points) can be measured by single linkage distance, complete linkage distance, or average linkage distance.

➩ Single linkage distance: the shortest distance from any item in one cluster to any item in the other cluster: Wikipedia.

➩ Complete linkage distance: the longest distance from any item in one cluster to any item in the other cluster: Wikipedia.

➩ Average linkage distance: the average distance from any item in one cluster to any item in the other cluster (average of distances, not distance between averages): Wikipedia.

In-class Discussion

ID:

📗 [1 points] Highlight the Euclidean distance between the two clusters (red and blue) measured by the linkage distance.

Distance:

[Q3] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Other students' answers:

In-class Quiz

ID:

📗 [4 points] You are given the distance table. Consider the next iteration of hierarchical agglomerative clustering (another name for the hierarchical clustering method we covered in the lectures) using linkage. What will the new values be in the resulting distance table corresponding to the new clusters? If you merge two columns (rows), put the new distances in the column (row) with the smaller index. For example, if you merge columns 2 and 4, the new column 2 should contain the new distances and column 4 should be removed, i.e. the columns and rows should be in the order (1), (2 and 4), (3), (5).

\(d\) =

📗 Answer (matrix with multiple lines, each line is a comma separated vector): .

[Note] Use the space to explain the steps or just take notes:

[Q4] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Other students' answers:

# Number of Clusters

📗 The number of clusters should be chosen based on prior knowledge about the dataset.

📗 The algorithm can also stop merging as soon as all the between-cluster distances are larger than some fixed threshold.

📗 The binary tree generated by hierarachical clustering is often called dendrogram: Wikipedia.

# K Means Clustering

📗 K-means clustering (2-means, 3-means, ...) iteratively updates a fixed number of cluster centers: Link, Wikipedia.

➩ Start with K random cluster centers.

➩ Assign each item to its closest center.

➩ Update all cluster centers as the center of its items.

In-class Discussion

ID:

📗 [1 points] Given the following dataset, use k-means clustering to divide the points into groups. Move the centers and click on the center to move it to the center of the points closest to the center.

Total distortion:

[Q5] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Other students' answers:

# Total Distortion

📗 K means clustering tries to minimize the total distances of all items to their cluster centers. The total distance is called total distortion or inertia.

📗 Suppose the cluster centers are \(c_{1}, c_{2}, ..., c_{K}\), and the cluster center for an item \(x_{i}\) is \(c\left(x_{i}\right)\) (one of \(c_{1}, c_{2}, ..., c_{K}\)), then the total distortion is \(\left\|x_{1} - c\left(x_{1}\right)\right\|_{2}^{2} + \left\|x_{2} - c\left(x_{2}\right)\right\|_{2}^{2} + ... + \left\|x_{n} - c\left(x_{n}\right)\right\|_{2}^{2}\).

Math Note

📗 The K means procedure is similar to the gradient descent method to minimize the total distortion: Wikipedia.

➩ The gradient of the total distortion with respect to the cluster centers is \(-2 \displaystyle\sum_{x : c\left(x\right) = c_{k}} \left(x - c_{k}\right)\), setting this to \(0\) to obtain the update step formula \(c_{k} = \dfrac{1}{n_{k}} \displaystyle\sum_{x: c\left(x\right) = c_{k}} x\), where \(n_{k}\) is the number of items that belongs to cluster \(k\), and the sum is over all items in cluster \(k\).

➩ One issue with some optimization algorithms like gradient descent is that they sometimes converge to local minima that are not the global minimum. This is also the case for K means clustering: Wikipedia.

📗 [1 points] Move the point and change the learning rate to see the derivatives (slope of tangent line) of the function \(x^{2}\). Find an initial point + learning rate combination so that gradient descent will not find the global minimum.

Point: 0
Learning rate: 0.5
Derivative: 0
Point found after gradient descent: 0

In-class Quiz

ID:

📗 [3 points] Perform k-means clustering on six points: \(x_{1}\) = , \(x_{2}\) = , \(x_{3}\) = , \(x_{4}\) = , \(x_{5}\) = , \(x_{6}\) = . Initially the cluster centers are at \(c_{1}\) = , \(c_{2}\) = . Run k-means for one iteration (assign the points, update center once and reassign the points once). Break ties in distances by putting the point in the cluster with the smaller index (i.e. favor cluster 1). What is the reduction in total distortion? Use Euclidean distance and calculate the total distortion by summing the squares of the individual distances to the center.

📗 Note: the red points are the cluster centers and the other points are the training items.

📗 Answer: .

[Note] Use the space to explain the steps or just take notes:

[Q6] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Other students' answers:

# Number of Clusters

📗 There are a few ways to choose the number of clusters K.

➩ K can be chosen based on prior knowledge about the items.

➩ K cannot be chosen by minimizing total distortion since the total distortion is always minimized at \(0\) when \(K = n\) (number of clusters = number of training items).

➩ K can be chosen by minimizing total distortion plus some regularizer, for example, \(c \cdot m K \log\left(n\right)\) where \(c\) is a fixed constant and \(m\) is the number of features for each item.

In-class Quiz

📗 [1 points] Upload an image and use K-means clustering to group the pixels into \(K\) clusters. Find an appropriate value of \(K\):
. Click on the image to perform the clustering for iterations.

Number of clusters:

[Q7] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

# Initial Clusters

📗 There are a few ways to initialize the clusters: Link.

➩ The initial cluster centers can be randomly chosen in the domain.

➩ The initial cluster centers can be randomly chosen as \(K\) distinct items.

➩ The first cluster center can be a random item, the second cluster center can be the item that is the farthest from the first item, the third cluster center can be the item that is the farthest from the first two items, ...

# Graph-Based Clustering

📗 Given a graph \(G = \left(V, E\right)\):

➩ The vertices are items.

➩ The edges encode similarity between items, for example, k-nearest neighbor graph (unweighted, \(w_{ij} = 1\) if \(i\) is one of the k nearest neighbors of \(j\) and \(0\) otherwise), or fully connected similarity graph (weighted, \(w_{ij} = \exp\left(- \dfrac{\left\|x_{i} - x_{j}\right\|^{2}}{2 \sigma^{2}}\right)\)).

📗 The goal is to cut (partition) the graph (node set) \(V\) into \(C_{1}, C_{2}, ..., C_{K}\) in a way that:

➩ Minimizes the weight of the cut: \(\dfrac{1}{2} \displaystyle\sum_{k=1}^{K} \displaystyle\sum_{i \in C_{k}, j \notin C_{k}} w_{ij}\).

➩ Minimizes the normalized cut: \(\dfrac{1}{2} \displaystyle\sum_{k=1}^{K} \dfrac{1}{\displaystyle\sum_{i \in C_{k}} \text{deg}\left(i\right)} \displaystyle\sum_{i \in C_{k}, j \notin C_{k}} w_{ij}\). The cut cost is normalized by how big the cluster is to avoid cutting off small clusters.

In-class Discussion

ID:

📗 [1 points] Write down the adjacency matrix and degree matrix formed by :

➩ For K nearest neighbor, \(k\) = .

➩ For Gaussian weights, \(\sigma\) = .

📗 Answer (matrix with multiple lines, each line is a comma separated vector):

[Note] Use the space to explain the steps or just take notes:

[Q8] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Other students' answers:

In-class Quiz

ID:

📗 [3 points] Compute the cut and normalized cut of the edge between nodes \(1\) and .

📗 Answer (comma separated vector):

[Note] Use the space to explain the steps or just take notes:

[Q9] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Other students' answers:

# Graph Laplacian

📗 Laplacian in calculus is \(\Delta f = \displaystyle\sum_{i} \dfrac{\partial^2 f}{\partial x_{i}^2}\) that measures how the average value of the function around a point differs from the value at that point.

📗 Laplacian of a graph is \(L = D - A\) that measures how the average value of the neighbors differs from the value at that node.

➩ \(A\) is the adjacency matrix (or edge weight matrix).

➩ \(D\) is the diagonal matrix with node degrees (or \(D_{ii} = \displaystyle\sum_{j} A_{ij}\)) on the diagonal.

➩ An alternative is to use normalized Laplacian \(L = I - D^{- \dfrac{1}{2}} A D^{- \dfrac{1}{2}}\): Link.

In-class Quiz

ID:

📗 [3 points] Compute the graph Laplacian of the following graph?

📗 Answer (matrix with multiple lines, each line is a comma separated vector):

[Note] Use the space to explain the steps or just take notes:

[Q10] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Other students' answers:

# Spectral Clustering

📗 Compute Laplacian or normalized Laplacian \(L\).

📗 Sort eigenvalues in descending order and bottom K non-zero eigenvectors of \(L\): \(u_{1}, u_{2}, ..., u_{K}\).

📗 Set \(U\) to be \(n \times K\) matrix with columns \(\left[u_{1}, u_{2}, ..., u_{K}\right]\).

📗 Run k-means on the rows of \(U = \begin{bmatrix} x_{1} \\ x_{2} \\ ... \\ x_{n} \end{bmatrix}\).

➩ Spectral clustering is similar to PCA in that it reduces the dimensions of the original points (from \(m\) to \(K\)).

➩ Both use eigenvalues (PCA uses the largest K, Spectral Clustering uses the smallest k) but on different matrices (PCA uses covariance matrix, Spectral Clustering uses Laplacian): Link.

📗 Comparison with other clustering algorithms: Link.

Math Note

📗 Intuitively, minimizing cut is related to minimizing \(u^\top L u\) subject to \(\left\|u\right\| = 1\).

# Questions?

📗 If you have questions, please use (i) Zoom chat, (ii) Piazza: Link, (iii) Office hours and discussion sessions. Please do NOT use Canvas mail and use email only to the course instructor (not TAs) for grading issues.

Additional In-class Discussion

📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:

Notes (not visible to other students):
[Q11] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Submit your answer to see other students answers (click the submit button to refresh):

Additional In-class Quiz

📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:

A.
B.
C.
D.
E.
Notes (not visible to other students):
[Q12] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Submit your answer to see other students answers (click the submit button to refresh):

# In-class Quiz Instructions

📗 To get full points on the in-class quizzes for a lecture:

➩ Submit relevant answers to the questions discussed during the lecture: incorrect answers are okay.

➩ Some questions require [notes] to earn the point.

➩ Some questions require special ID (given during the lecture) to earn the point.

➩ Do not submit answers to questions that are not discussed during the lectures. Each such submission will result in a deduction of one point.

➩ Submissions after the lecture, before the midterm (first 14 lectures) and the final exam (last 14 lectures), are accepted. After the exams, no in-class quiz submissions will be accepted.

➩ The grade on Canvas Assignment Q16 is computed as number of points divided by the number of questions asked (out of 1) and updated on Canvas every weekend.

📗 If there are any issues with submission on the website, please use this Google form: Link.

📗 Bonus point opportunities during a few lectures (added to in-class quiz above 20 points).

📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Blerina Gkotse, Yudong Chen, Yingyu Liang, Charles Dyer. Some content are generated using Copilot .

Prev: L15, Next: L17

Last Updated: July 16, 2026 at 12:17 PM