Young Wu's Homepage

Prev: L38, Next: L40

# Lecture

📗 The lecture is in person, but you can join Zoom: 8:50-9:40 or 11:00-11:50. Zoom recordings can be viewed on Canvas -> Zoom -> Cloud Recordings. They will be moved to Kaltura over the weekends.

📗 The in-class (participation) quizzes should be submitted on TopHat (Code:741565), but you can submit your answers through Form at the end of the lectures too.

📗 The Python notebooks used during the lectures can also be found on: GitHub. They will be updated weekly.

# Lecture Notes

📗 Unsupervised Learning

➩ If the groups are discrete: clustering

➩ If the groups are continuous (lower dimensional representation): dimensionality reduction

➩ The output of unsupervised learning can be used as input for supervised learning too (discrete groups as categorical features and continuous groups as continuous features).

Item	Input (Features)	-
1	\(\left(x_{11}, x_{12}, ..., x_{1m}\right)\)	no label
2	\(\left(x_{21}, x_{22}, ..., x_{2m}\right)\)	-
3	\(\left(x_{31}, x_{32}, ..., x_{3m}\right)\)	-
...	...	...	...
n	\(\left(x_{n1}, x_{n2}, ..., x_{nm}\right)\)	similar \(x\) in the same or similar groups

US States Economic Data Example

➩ US economics data can be found on Link.

➩ Map data can be found on Link.

➩ Use the features "real per capita personal income", "real per capita personal consumption expenditures", and "regional price parities".

➩ Code for clustering: Notebook.

➩ Note: see pivot for the correct way of working with panel data: Doc.

Hierarchical Clustering

➩ Hierarchical clustering starts with \(n\) clusters and iteratively merge the closest clusters: Link.

➩ It is also called agglomerative clustering, and can be performed using sklearn.cluster.AgglomerativeClustering: Doc.

➩ Different ways of defining the distance between two clusters are called different linkages: scipy.cluster.hierarchy.linkage: Doc.

📗 Distance Measure

➩ The distance between points can be measured by norms, the distance between items \(x_{1} = \left(x_{11}, x_{12}, ..., x_{1m}\right)\) and \(x_{2} = \left(x_{21}, x_{22}, ..., x_{2m}\right)\) can be:

(1) Manhattan distance (metric = "manhattan"): \(\left| x_{11} - x_{21} \right| + \left| x_{12} - x_{22} \right| + ... + \left| x_{1m} - x_{2m} \right|\), Link,
(2) Euclidean distance (metric = "euclidean"): \(\sqrt{\left(x_{11} - x_{21}\right)^{2} + \left(x_{12} - x_{22}\right)^{2} + ... + \left(x_{1m} - x_{2m}\right)^{2}}\),
(3) Cosine similarity distance (metric = "cosine"): \(1 - \dfrac{x^\top_{1} x_{2}}{\sqrt{x^\top_{1} x_{1}} \sqrt{x^\top_{2} x_{2}}}\).
...

Average Linkage Distance

➩ If average linkage distance (linkage = "average") is used, then the distance between two clusters is defined as the distance between two center points of the clusters.

➩ This requires recomputing the centers and their pairwise distances in every iteration and can be very slow.

📗 Single and Complete Linkage Distance

➩ If single linkage distance (linkage = "single") is used, then the distance between two clusters is defined as the smallest distance between any pairs of points one from each cluster.

➩ If complete linkage distance (linkage = "complete") is used, then the distance between two clusters is defined as the largest distance between any pairs of points one from each cluster.

➩ With single or complete linkage distances, pairwise distances between points only have to be computed once at the beginning, so clustering is typically faster.

📗 Single vs Complete Linkage

➩ Since single linkage distance finds the nearest neighbors, it is more likely to have clusters that look like chains in which pairs of points are close to each other.

➩ Since complete linkage distance finds the farthest neighbors, it is more likely to have clusters that look like blobs (for example circles) in which all points are closest to a center.

➩ The choice usually depends on the application.

Comparison Example

➩ Compare single and complete linkage clustering on the circles, moons datasets.

➩ Code for clustering: Notebook.

Number of Clusters

➩ The number of clusters are usually chosen based on application requirements, since there is no optimal number of clusters.

➩ If the number of clusters is not specified, the algorithm can output a clustering tree, called dendrogram.

➩ scipy.cluster.hierarchy.dendrogram: Doc.

📗 Comparison

➩ Since the labeling of clusters are arbitrary, two clusterings with labels that are permutations of each other should be considered as the same clustering.

➩ Rand index is one measure of similarity between clusterings.

➩ sklearn.metrics.rand_score(y1, y2) computes the similarity between clustering y1 and clustering y2, given by two lists of labels: Doc.

➩ To compute the Rand index, loop through all pairs of items and count the number of pairs the clusterings agree on, then divide it by the total number of pairs.

📗 Adjusted Rand Index

➩ Rand index is a similarity score between 0 and 1, where 1 represents perfect match: the clustering labels are permutations of each other.

➩ The meaning of 0 is not clear.

➩ Adjusted Rand index is used so that 0 represents random labeling.

➩ sklearn.metrics.adjusted_rand_score(y1, y2) computes the similarity between clustering y1 and clustering y2, given by two lists of labels: Doc.

Notes and code adapted from the course taught by Yiyin Shen Link and Tyler Caraza-Harter Link

Last Updated: July 01, 2025 at 1:46 AM