Prev: L38, Next: L40

# Lecture

📗 The lecture is in person, but you can join Zoom: 8:50-9:40 or 11:00-11:50. Zoom recordings can be viewed on Canvas -> Zoom -> Cloud Recordings. They will be moved to Kaltura over the weekends.
📗 The in-class (participation) quizzes should be submitted on TopHat (Code:741565), but you can submit your answers through Form at the end of the lectures too.
📗 The Python notebooks used during the lectures can also be found on: GitHub. They will be updated weekly.


# Lecture Notes

📗 Unsupervised Learning
➭ If the groups are discrete: clustering
➭ If the groups are continuous (lower dimensional representation): dimensionality reduction
➭ The output of unsupervised learning can be used as input for supervised learning too (discrete groups as categorical features and continuous groups as continuous features).

Item Input (Features) -
1 \(\left(x_{11}, x_{12}, ..., x_{1m}\right)\) no label
2 \(\left(x_{21}, x_{22}, ..., x_{2m}\right)\) -
3 \(\left(x_{31}, x_{32}, ..., x_{3m}\right)\) -
... ... ... ...
n \(\left(x_{n1}, x_{n2}, ..., x_{nm}\right)\) similar \(x\) in the same or similar groups


US States Economic Data Example ➭ US economics data can be found on Link.
➭ Map data can be found on Link.
➭ Use the features "real per capita personal income", "real per capita personal consumption expenditures", and "regional price parities".
➭ Code for clustering: Notebook.
➭ Note: see pivot for the correct way of working with panel data: Doc.



📗 Hierarchical Clustering
➭ Hierarchical clustering starts with \(n\) clusters and iteratively merge the closest clusters: Link.
➭ It is also called agglomerative clustering, and can be performed using sklearn.cluster.AgglomerativeClustering: Doc.
➭ Different ways of defining the distance between two clusters are called different linkages: scipy.cluster.hierarchy.linkage: Doc.

📗 Distance Measure
➭ The distance between points can be measured by norms, the distance between items \(x_{1} = \left(x_{11}, x_{12}, ..., x_{1m}\right)\) and \(x_{2} = \left(x_{21}, x_{22}, ..., x_{2m}\right)\) can be:
(1) Manhattan distance (metric = "manhattan"): \(\left| x_{11} - x_{21} \right| + \left| x_{12} - x_{22} \right| + ... + \left| x_{1m} - x_{2m} \right|\), Link,
(2) Euclidean distance (metric = "euclidean"): \(\sqrt{\left(x_{11} - x_{21}\right)^{2} + \left(x_{12} - x_{22}\right)^{2} + ... + \left(x_{1m} - x_{2m}\right)^{2}}\),
(3) Cosine similarity distance (metric = "cosine"): \(1 - \dfrac{x^\top_{1} x_{2}}{\sqrt{x^\top_{1} x_{1}} \sqrt{x^\top_{2} x_{2}}}\).
...



📗 Average Linkage Distance
➭ If average linkage distance (linkage = "average") is used, then the distance between two clusters is defined as the distance between two center points of the clusters.
➭ This requires recomputing the centers and their pairwise distances in every iteration and can be very slow.

📗 Single and Complete Linkage Distance
➭ If single linkage distance (linkage = "single") is used, then the distance between two clusters is defined as the smallest distance between any pairs of points one from each cluster.
➭ If complete linkage distance (linkage = "complete") is used, then the distance between two clusters is defined as the largest distance between any pairs of points one from each cluster.
➭ With single or complete linkage distances, pairwise distances between points only have to be computed once at the beginning, so clustering is typically faster.

📗 Single vs Complete Linkage
➭ Since single linkage distance finds the nearest neighbors, it is more likely to have clusters that look like chains in which pairs of points are close to each other.
➭ Since complete linkage distance finds the farthest neighbors, it is more likely to have clusters that look like blobs (for example circles) in which all points are closest to a center.
➭ The choice usually depends on the application. 

Comparison Example ➭ Compare single and complete linkage clustering on the circles, moons datasets.
➭ Code for clustering: Notebook.



📗 Number of Clusters
➭ The number of clusters are usually chosen based on application requirements, since there is no optimal number of clusters.
➭ If the number of clusters is not specified, the algorithm can output a clustering tree, called dendrogram.
scipy.cluster.hierarchy.dendrogram: Doc.

📗 Comparison
➭ Since the labeling of clusters are arbitrary, two clusterings with labels that are permutations of each other should be considered as the same clustering.
➭ Rand index is one measure of similarity between clusterings.
sklearn.metrics.rand_score(y1, y2) computes the similarity between clustering y1 and clustering y2, given by two lists of labels: Doc.
➭ To compute the Rand index, loop through all pairs of items and count the number of pairs the clusterings agree on, then divide it by the total number of pairs.

📗 Adjusted Rand Index
➭ Rand index is a similarity score between 0 and 1, where 1 represents perfect match: the clustering labels are permutations of each other.
➭ The meaning of 0 is not clear.
➭ Adjusted Rand index is used so that 0 represents random labeling.
sklearn.metrics.adjusted_rand_score(y1, y2) computes the similarity between clustering y1 and clustering y2, given by two lists of labels: Doc.




📗 Notes and code adapted from the course taught by Yiyin Shen Link and Tyler Caraza-Harter Link






Last Updated: April 29, 2024 at 1:10 AM