Prev: L15, Next: L17

Zoom: Link, Piazza: Link, Google Form: Link.

Wisc ID for in-class quiz: (if your wisc email is "test@wisc.edu", please enter "test")
Token: (will be given during the lectures)



# Warning: this is a draft and will be updated one day before the lecture.


Slide:



# Unsupervised Learning

📗 Supervised learning: \(\left(x_{1}, y_{1}\right), \left(x_{2}, y_{2}\right), ..., \left(x_{n}, y_{n}\right)\).
📗 Unsupervised learning: \(\left(x_{1}\right), \left(x_{2}\right), ..., \left(x_{n}\right)\).
➩ Clustering: separates items into groups.
➩ Novelty (outlier) detection: finds items that are different (two groups).
➩ Dimensionality reduction: represents each item by a lower dimensional feature vector while maintaining key characteristics.
📗 Unsupervised learning applications:
➩ Google news.
➩ Google photo.
➩ Image segmentation.
➩ Text processing.
➩ Data visualization.
➩ Efficient storage.
➩ Noise removal.



# High Dimensional Data

📗 Text and image data are usually high dimensional: Link.
➩ The number of features of bag of words representation is the size of the vocabulary.
➩ The number of features of pixel intensity features is the number of pixels of the images.
📗 Dimensionality reduction is a form of unsupervised learning since it does not require labeled training data.



# Principal Component Analysis

📗 Principal component analysis rotates the axes (\(x_{1}, x_{2}, ..., x_{m}\)) so that the first \(K\) new axes (\(u_{1}, u_{2}, ..., u_{K}\)) capture the directions of the greatest variability of the training data. The new axes are called principal components: Link, Wikipedia.
➩ Find the direction of the greatest variability, \(u_{1}\).
➩ Find the direction of the greatest variability that is orthogonal (perpendicular) to \(u_{1}\), say \(u_{2}\).
➩ Repeat until there are \(K\) such directions \(u_{1}, u_{2}, ..., u_{K}\).
In-class Discussion ID:
📗 [1 points] Given the following dataset, find the direction (click on the diagram below to change the direction) in which the variation is the largest.

Projected variance:
Direction:
Projected points:
In-class Discussion ID:
📗 [1 points] Given the following dataset, find the direction (click on the diagram below to change the direction) in which the variation is the largest.

Projected variance:
Direction:
Projected points:



# Geometry

📗 A vector \(u_{k}\) is a unit vector if it has length 1: \(\left\|u_{k}\right\|^{2} = u^\top_{k} u_{k} = u_{k 1}^{2} + u_{k 2}^{2} + ... + u_{k m}^{2} = 1\).
📗 Two vectors \(u_{j}, u_{k}\) are orthogonal (or uncorrelated) if \(u^\top_{j} u_{k} = u_{j 1} u_{k 1} + u_{j 2} u_{k 2} + ... + u_{j m} u_{k m} = 0\).
📗 The projection of \(x_{i}\) onto a unit vector \(u_{k}\) is \(\left(u^\top_{k} x_{i}\right) u_{k} = \left(u_{k 1} x_{i 1} + u_{k 2} x_{i 2} + ... + u_{k m} x_{i m}\right) u_{k}\) (it is a number \(u^\top_{k} x_{i}\) multiplied by a vector \(u_{k}\)). Since \(u_{k}\) is a unit vector, the length of the projection is \(u^\top_{k} x_{i}\).
Math Note
📗 The dot product between two vectors \(a = \left(a_{1}, a_{2}, ..., a_{m}\right)\) and \(b = \left(b_{1}, b_{2}, ..., b_{m}\right)\) is usually written as \(a \cdot b = a^\top b = \begin{bmatrix} a_{1} & a_{2} & ... & a_{m} \end{bmatrix} \begin{bmatrix} b_{1} \\ b_{2} \\ ... \\ b_{m} \end{bmatrix} = a_{1} b_{1} + a_{2} b_{2} + ... + a_{m} b_{m}\). In this course, to avoid confusion with scalar multiplication, the notation \(a^\top b\) will be used instead of \(a \cdot b\).
📗 If \(x_{i}\) is projected onto some vector \(u_{k}\) that is not a unit vector, then the formula for projection is \(\left(\dfrac{u^\top_{k} x_{i}}{u^\top_{k} u_{k}}\right) u_{k}\). Since for unit vector \(u_{k}\), \(u^\top_{k} u_{k} = 1\), the two formulas are equivalent.
In-class Discussion
📗 [1 points] Compute the projection of the red vector onto the blue vector (drag the tips of the red or blue arrow, the green arrow represents the projection).

Red vector: , blue vector: .
Unit red vector: , unit blue vector: .
Projection: , length of projection: .



# Statistics

📗 The (unbiased) estimate of the variance of \(x_{1}, x_{2}, ..., x_{n}\) in one dimensional space (\(m = 1\)) is \(\dfrac{1}{n - 1} \left(\left(x_{1} - \mu\right)^{2} + \left(x_{2} - \mu\right)^{2} + ... + \left(x_{n} - \mu\right)^{2}\right)\), where \(\mu\) is the estimate of the mean (average) or \(\mu = \dfrac{1}{n} \left(x_{1} + x_{2} + ... + x_{n}\right)\). The maximum likelihood estimate has \(\dfrac{1}{n}\) instead of \(\dfrac{1}{n-1}\).
📗 In higher dimensional space, the estimate of the variance is \(\dfrac{1}{n - 1} \left(\left(x_{1} - \mu\right)\left(x_{1} - \mu\right)^\top + \left(x_{2} - \mu\right)\left(x_{2} - \mu\right)^\top + ... + \left(x_{n} - \mu\right)\left(x_{n} - \mu\right)^\top\right)\). Note that \(\mu\) is an \(m\) dimensional vector, and each of the \(\left(x_{i} - \mu\right)\left(x_{i} - \mu\right)^\top\) is an \(m\) by \(m\) matrix, so the resulting variance estimate is a matrix called variance-covariance matrix.
📗 If \(\mu = 0\), then the projected variance of \(x_{1}, x_{2}, ..., x_{n}\) in the direction \(u_{k}\) can be computed by \(u^\top_{k} \Sigma u_{k}\) where \(\Sigma = \dfrac{1}{n - 1} X^\top X\), and \(X\) is the data matrix where row \(i\) is \(x_{i}\).
➩ If \(\mu \neq 0\), then \(X\) should be centered, that is, the mean of each column should be subtracted from each column.
Math Note
📗 The projected variance formula can be derived by \(u^\top_{k} \Sigma u_{k} = \dfrac{1}{n - 1} u^\top_{k} X^\top X u_{k} = \dfrac{1}{n - 1} \left(\left(u^\top_{k} x_{1}\right)^{2} + \left(u^\top_{k} x_{2}\right)^{2} + ... + \left(u^\top_{k} x_{n}\right)^{2}\right)\) which is the estimate of the variance of the projection of the data in the \(u_{k}\) direction.



# Principal Component Analysis

📗 The goal is to find the direction that maximizes the projected variance: \(\displaystyle\max_{u_{k}} u^\top_{k} \Sigma u_{k}\) subject to \(u^\top_{k} u_{k} = 1\).
➩ This constrained maximization problem has solution (local maxima) \(u_{k}\) that satisfies \(\Sigma u_{k} = \lambda u_{k}\), and by definition of eigenvalues, \(u_{k}\) is the eigenvector corresponding to the eigenvalue \(\lambda\) for the matrix \(\Sigma\): Wikipedia.
➩ At a solution, \(u^\top_{k} \Sigma u_{k} = u^\top_{k} \lambda u_{k} = \lambda u^\top_{k} u_{k} = \lambda\), which means, the larger the \(\lambda\), the larger the variability in the direction of \(u_{k}\).
➩ Therefore, if all eigenvalues of \(\Sigma\) are computed and sorted \(\lambda_{1} \geq \lambda_{2} \geq ... \geq \lambda_{m}\), then the corresponding eigenvectors are the principal components: \(u_{1}\) is the first principal component corresponding to the direction of the largest variability; \(u_{2}\) is the second principal component corresponding to the direction of the second largest variability orthogonal to \(u_{1}\), ...
In-class Quiz (Past Exam Question) ID:
📗 [3 points] Given the variance matrix \(\hat{\Sigma}\) = , what is the first principal component? Enter a unit vector.
📗 Answer (comma separated vector): .




# Number of Dimensions

📗 There are a few ways to choose the number of principal components \(K\).
➩ \(K\) can be selected based on prior knowledge or requirement (for example, \(K = 2, 3\) for visualization tasks).
➩ \(K\) can be the number of non-zero eigenvalues.
➩ \(K\) can be the number of eigenvalues that are larger than some threshold.



# Reduced Feature Space

📗 An original item is in the \(m\) dimensional feature space: \(x_{i} = \left(x_{i 1}, x_{i 2}, ..., x_{i m}\right)\).
📗 The new item is in the \(K\) dimensional space with basis \(u_{1}, u_{2}, ..., u_{k}\) has coordinates equal to the projected lengths of the original item: \(\left(u^\top_{1} x_{i}, u^\top_{2} x_{i}, ..., u^\top_{k} x_{i}\right)\).
📗 Other supervised learning algorithms can be applied on the new features.
In-class Quiz (Past Exam Question) ID:
📗 [2 points] You performed PCA (Principal Component Analysis) in \(\mathbb{R}^{3}\). If the first principal component is \(u_{1}\) = \(\approx\) and the second principal component is \(u_{2}\) = \(\approx\) . What is the new 2D coordinates (new features created by PCA) for the point \(x\) = ?

📗 In the diagram, the black axes are the original axes, the green axes are the PCA axes, the red vector is \(x\), the red point is the reconstruction \(\hat{x}\) using the PCA axes.
📗 Answer (comma separated vector): .




# Reconstruction

📗 The original item can be reconstructed using the principal components. If all \(m\) principal components are used, then the original item can be perfectly reconstructed: \(x_{i} = \left(u^\top_{1} x_{i}\right) u_{1} + \left(u^\top_{2} x_{i}\right) u_{2} + ... + \left(u^\top_{m} x_{i}\right) u_{m}\).
📗 The original item can be approximated by the first \(K\) principal components: \(x_{i} \approx \left(u^\top_{1} x_{i}\right) u_{1} + \left(u^\top_{2} x_{i}\right) u_{2} + ... + \left(u^\top_{K} x_{i}\right) u_{K}\).
➩ Eigenfaces are eigenvectors of face images: every face can be written as a linear combination of eigenfaces. The first \(K\) eigenfaces and their coefficients can be used to determine and reconstruct specific faces: Link, Wikipedia.
Example
📗 An example of 111 eigenfaces:

📗 Reconstruction examples:
K Face 1 Face 2
1
Face
Face
20
Face
Face
40
Face
Face
50
Face
Face
80
Face
Face

➩ Images by wellecks



# Non-linear PCA

📗 PCA features are linear in the original features.
📗 The kernel trick (used in support vector machines) can be used to first produce new features that are non-linear in the original features, then PCA can be applied to the new features: Wikipedia
➩ Dual optimization techniques can be used for kernels such as the RBF kernel which produces infinite number of new features.
📗 A neural network with both input and output being the features (no soft-max layer) is can be used for non-linear dimensionality reduction, called an auto-encoder: Wikipedia.
➩ The hidden units (\(K < m\) units, fewer than the number of input features) can be considered a lower dimensional representation of the original features.



# Graph-Based Clustering

📗 Given a graph \(G = \left(V, E\right)\):
➩ The vertices are items.
➩ The edges encode similarity between items, for example, k-nearest neighbor graph (unweighted, \(w_{ij} = 1\) if \(i\) is one of the k nearest neighbors of \(j\) and \(0\) otherwise), or fully connected similarity graph (weighted, \(w_{ij} = \exp\left(- \dfrac{\left\|x_{i} - x_{j}\right\|^{2}}{2 \sigma^{2}}\right)\)).
📗 The goal is to cut (partition) the graph (node set) \(V\) into \(C_{1}, C_{2}, ..., C_{K}\) in a way that:
➩ Minimizes the weight of the cut: \(\dfrac{1}{2} \displaystyle\sum_{k=1}^{K} \displaystyle\sum_{i \in C_{k}, j \notin C_{k}} w_{ij}\).
➩ Minimizes the normalized cut: \(\dfrac{1}{2} \displaystyle\sum_{k=1}^{K} \dfrac{1}{\displaystyle\sum_{i \in C_{k}} \text{deg}\left(i\right)} \displaystyle\sum_{i \in C_{k}, j \notin C_{k}} w_{ij}\). The cut cost is normalized by how big the cluster is to avoid cutting off small clusters.
In-class Discussion ID:
📗 [1 points] Write down the adjacency matrix and degree matrix formed by :

➩ For K nearest neighbor, \(k\) = .
➩ For Gaussian weights, \(\sigma\) = .
In-class Quiz ID:
📗 [3 points] Compute the cut and normalized cut of the edge between nodes \(1\) and .

📗 Answer (comma separated vector):




# Graph Laplacian

📗 Laplacian in calculus is \(\Delta f = \displaystyle\sum_{i} \dfrac{\partial^2 f}{\partial x_{i}^2}\) that measures how the average value of the function around a point differs from the value at that point.
📗 Laplacian of a graph is \(L = D - A\) that measures how the average value of the neighbors differs from the value at that node.
➩ \(A\) is the adjacency matrix (or edge weight matrix).
➩ \(D\) is the diagonal matrix with node degrees (or \(D_{ii} = \displaystyle\sum_{j} A_{ij}\)) on the diagonal.
➩ An alternative is to use normalized Laplacian \(L = I - D^{- \dfrac{1}{2}} A D^{- \dfrac{1}{2}}\): Link.
In-class Quiz ID:
📗 [3 points] Compute the graph Laplacian of the following graph?

📗 Answer (matrix with multiple lines, each line is a comma separated vector):




# Spectral Clustering

📗 Compute Laplacian or normalized Laplacian \(L\).
📗 Sort eigenvalues in descending order and bottom K non-zero eigenvectors of \(L\): \(u_{1}, u_{2}, ..., u_{K}\).
📗 Set \(U\) to be \(n \times K\) matrix with columns \(\left[u_{1}, u_{2}, ..., u_{K}\right]\).
📗 Run k-means on the rows of \(U = \begin{bmatrix} x_{1} \\ x_{2} \\ ... \\ x_{n} \end{bmatrix}\).
➩ Spectral clustering is similar to PCA in that it reduces the dimensions of the original points (from \(m\) to \(K\)).
➩ Both use eigenvalues (PCA uses the largest K, Spectral Clustering uses the smallest k) but on different matrices (PCA uses covariance matrix, Spectral Clustering uses Laplacian): Link.
📗 Comparison with other clustering algorithms: Link.
Math Note
📗 Intuitively, minimizing cut is related to minimizing \(u^\top L u\) subject to \(\left\|u\right\| = 1\).



# T-distributed Stochastic Neighbor Embedding

📗 A common application of dimensionality reduction is for visualization into \(K = 2, 3\) dimensions.
📗 T-SNE (T-distributed Stochastic Neighbor Embedding) converts high dimensional vectors into \(K = 2, 3\) dimensonal vectors by:
➩ Turn vectors into probability pairs.
➩ Turn pairs back into low dimensional vectors.
📗 Examples: Link, Link, Link.



# Probability of Neighbor

📗 The similarity between \(x_{i}\) and \(x_{j}\) is measured by the probability that \(x_{i}\) would pick \(x_{j}\) as its neighbor under a Gaussian probability.
➩ \(p_{ij} = \dfrac{1}{2 n} \left(p_{j | i} + p_{i | j}\right)\), where,
➩ \(p_{i | j} = \dfrac{\exp\left(- \dfrac{\left\|x_{i} - x_{j}\right\|^{2}}{2 \sigma_{j}^{2}}\right)}{\displaystyle\sum_{k \neq j} \exp\left(- \dfrac{\left\|x_{i} - x_{k}\right\|^{2}}{2 \sigma_{j}^{2}}\right)}\) and \(p_{j | i} = \dfrac{\exp\left(- \dfrac{\left\|x_{i} - x_{j}\right\|^{2}}{2 \sigma_{i}^{2}}\right)}{\displaystyle\sum_{k \neq i} \exp\left(- \dfrac{\left\|x_{i} - x_{k}\right\|^{2}}{2 \sigma_{i}^{2}}\right)}\).
📗 A parameter called perplexity is used to determine \(\sigma_{i}\).
➩ Small perplexity = small \(\sigma_{i}\) = local neighbors.
➩ Large perplexity = large \(\sigma_{i}\) = global neighbors.
➩ Perplexity measures the "effective" number of neighbors.



# Low Dimensional Space

📗 Place points randomly in \(K = 2, 3\) dimensional space.
📗 Use T distribution (1 degree of freedom) to measure similarity:
➩ \(q_{ij} = \dfrac{\left(1 + \left\|y_{i} - y_{j}\right\|^{2}\right)^{-1}}{\displaystyle\sum_{k \neq l} \left(1 + \left\|y_{i} - y_{j}\right\|^{2}\right)^{-1}}\).



# Kullback Leibler Divergence

📗 Gaussian distribution is used in the original space (light tail).
📗 T distribution is used in the low dimensional space (heavy tail, so that points are not crowded).
📗 The objective is to choose the new points \(y_{i}\) so that \(D = \displaystyle\sum_{ij} p_{ij} \log \left(\dfrac{p_{ij}}{q_{ij}}\right)\) is minimized.
➩ \(D\) is called the KL (Kullback Leibler) Divergence and measures the similarity between two distributions (similar to cross entropy).
📗 Gradient descent can be used for minimization.




# Questions?

📗 If you have questions, please use (i) Zoom chat, (ii) Piazza: Link, (iii) Office hours and discussion sessions. Please do NOT use Canvas mail and use email only to the course instructor (not TAs) for grading issues.
Additional In-class Discussion
📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:

Notes (not visible to other students):


Submit your answer to see other students answers (click the submit button to refresh): 

Additional In-class Quiz
📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:
A.
B.
C.
D.
E.
Notes (not visible to other students):


Submit your answer to see other students answers (click the submit button to refresh): 






# In-class Quiz Instructions

📗 To get full points on the in-class quizzes for a lecture:
➩ Submit relevant answers to the questions discussed during the lecture: incorrect answers are okay.
➩ Some questions require [notes] to earn the point.
➩ Some questions require special ID (given during the lecture) to earn the point.
➩ Do not submit answers to questions that are not discussed during the lectures. Each such submission will result in a deduction of one point.
➩ Submissions after the lecture, before the midterm (first 14 lectures) and the final exam (last 14 lectures), are accepted. After the exams, no in-class quiz submissions will be accepted.
➩ The grade on Canvas Assignment Q16 is computed as number of points divided by the number of questions asked (out of 1) and updated on Canvas every weekend.
📗 If there are any issues with submission on the website, please use this Google form: Link.
📗 Bonus point opportunities during a few lectures (added to in-class quiz above 20 points).
📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Blerina Gkotse, Yudong Chen, Yingyu Liang, Charles Dyer. Some content are generated using Copilot .

Prev: L15, Next: L17





Last Updated: June 25, 2026 at 12:40 AM