📗 If the groups are continuous (lower dimensional representation): dimensionality reduction
➩ The output of unsupervised learning can be used as input for supervised learning too (discrete groups as categorical features and continuous groups as continuous features).
📗 The distance between points can be measured by norms, the distance between items \(x_{1} = \left(x_{11}, x_{12}, ..., x_{1m}\right)\) and \(x_{2} = \left(x_{21}, x_{22}, ..., x_{2m}\right)\) can be:
📗 If average linkage distance (linkage = "average") is used, then the distance between two clusters is defined as the average distance between every pair of points one from each cluster.
➩ This requires recomputing the centers and their pairwise distances in every iteration and can be very slow.
📗 If single linkage distance (linkage = "single") is used, then the distance between two clusters is defined as the smallest distance between any pairs of points one from each cluster.
📗 If complete linkage distance (linkage = "complete") is used, then the distance between two clusters is defined as the largest distance between any pairs of points one from each cluster.
➩ With single or complete linkage distances, pairwise distances between points only have to be computed once at the beginning, so clustering is typically faster.
📗 Since single linkage distance finds the nearest neighbors, it is more likely to have clusters that look like chains in which pairs of points are close to each other.
📗 Since complete linkage distance finds the farthest neighbors, it is more likely to have clusters that look like blobs (for example circles) in which all points are closest to a center.
➩ The choice usually depends on the application.
Comparison Example
➩ Compare single and complete linkage clustering on the circles, moons datasets.
📗 Another clustering method is K means cluster: Link.
(0) Start with \(K\) random centers (also called centroids) \(\mu_{1}, \mu_{2}, ..., \mu_{K}\).
(1) Assign step: find points (items) that are the closest to each center \(k\), label these points as \(k\).
(2) Center step: update center \(\mu_{k}\) to be the center of the points labeled \(k\).
(3) Repeat until cluster centers do not change.
📗 The objective of K means clustering is minimizing the total distortion, also called inertia, the sum of distances (usually squared Euclidean distances) from the points to their centers, or \(\displaystyle\sum_{i=1}^{n} \left\|x_{i} - \mu_{k\left(x_{i}\right)}\right\|^{2}\) = \(\displaystyle\sum_{i=1}^{n} \displaystyle\sum_{j=1}^{m} \left(x_{ij} - \mu_{k\left(x_{i}\right)j}^{2}\right)\), where \(k\left(x_{i}\right)\) is the cluster index of the cluster closest to \(x_{i}\), or \(k\left(x_{i}\right) = \mathop{\mathrm{argmin}}_{k} \left\|x_{i} - \mu_{k}\right\|\).
➩ K means initialized at a random clustering and each assign-center step is a gradient descent step for minimizing total distortion by choosing the cluster centers.
📗 The number of clusters are usually chosen based on application requirements, since there is no optimal number of clusters.
➩ If the number of cluster is \(n\) (each point is in a different cluster), then the total distortion is 0. This means minimizing the total distortion is not a good way to select the number of clusters.
➩ Elbow method is sometimes use to determine the number of clusters based on the total distortion, but it is a not a clearly defined algorithm: Link.
Economic Data Example Again
➩ Apply 5-means clustering on the economic data for the US states.
➩ Compare K-means with different values of K: the "elbow method" seems to suggest around 4 to 6 clusters.
📗 Notes and code adapted from the course taught by Professors Gurmail Singh, Yiyin Shen, Tyler Caraza-Harter.
📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.
📗 Anonymous feedback can be submitted to: Form. Non-anonymous feedback and questions can be posted on Piazza: Link