Prev: L38, Next: L40

# Lecture

📗 The lecture is in person, but you can join Zoom: 8:50-9:40 or 11:00-11:50. Zoom recordings can be viewed on Canvas -> Zoom -> Cloud Recordings. They will be moved to Kaltura over the weekends.
📗 The in-class (participation) quizzes should be submitted on TopHat (Code:741565), but you can submit your answers through Form at the end of the lectures too.
📗 The Python notebooks used during the lectures can also be found on: GitHub. They will be updated weekly.


# Lecture Notes

📗 Unsupervised Learning
➩ If the groups are discrete: clustering
➩ If the groups are continuous (lower dimensional representation): dimensionality reduction
➩ The output of unsupervised learning can be used as input for supervised learning too (discrete groups as categorical features and continuous groups as continuous features).

Item Input (Features) -
1 \(\left(x_{11}, x_{12}, ..., x_{1m}\right)\) no label
2 \(\left(x_{21}, x_{22}, ..., x_{2m}\right)\) -
3 \(\left(x_{31}, x_{32}, ..., x_{3m}\right)\) -
... ... ... ...
n \(\left(x_{n1}, x_{n2}, ..., x_{nm}\right)\) similar \(x\) in the same or similar groups


US States Economic Data Example
➩ US economics data can be found on Link.
➩ Map data can be found on Link.
➩ Use the features "real per capita personal income", "real per capita personal consumption expenditures", and "regional price parities".
➩ Code for clustering: Notebook.
➩ Note: see pivot for the correct way of working with panel data: Doc.

 Hierarchical Clustering
➩ Hierarchical clustering starts with \(n\) clusters and iteratively merge the closest clusters: Link.
➩ It is also called agglomerative clustering, and can be performed using sklearn.cluster.AgglomerativeClustering: Doc.
➩ Different ways of defining the distance between two clusters are called different linkages: scipy.cluster.hierarchy.linkage: Doc.

📗 Distance Measure
➩ The distance between points can be measured by norms, the distance between items \(x_{1} = \left(x_{11}, x_{12}, ..., x_{1m}\right)\) and \(x_{2} = \left(x_{21}, x_{22}, ..., x_{2m}\right)\) can be:
(1) Manhattan distance (metric = "manhattan"): \(\left| x_{11} - x_{21} \right| + \left| x_{12} - x_{22} \right| + ... + \left| x_{1m} - x_{2m} \right|\), Link,
(2) Euclidean distance (metric = "euclidean"): \(\sqrt{\left(x_{11} - x_{21}\right)^{2} + \left(x_{12} - x_{22}\right)^{2} + ... + \left(x_{1m} - x_{2m}\right)^{2}}\),
(3) Cosine similarity distance (metric = "cosine"): \(1 - \dfrac{x^\top_{1} x_{2}}{\sqrt{x^\top_{1} x_{1}} \sqrt{x^\top_{2} x_{2}}}\).
...

 Average Linkage Distance
➩ If average linkage distance (linkage = "average") is used, then the distance between two clusters is defined as the distance between two center points of the clusters.
➩ This requires recomputing the centers and their pairwise distances in every iteration and can be very slow.

📗 Single and Complete Linkage Distance
➩ If single linkage distance (linkage = "single") is used, then the distance between two clusters is defined as the smallest distance between any pairs of points one from each cluster.
➩ If complete linkage distance (linkage = "complete") is used, then the distance between two clusters is defined as the largest distance between any pairs of points one from each cluster.
➩ With single or complete linkage distances, pairwise distances between points only have to be computed once at the beginning, so clustering is typically faster.

📗 Single vs Complete Linkage
➩ Since single linkage distance finds the nearest neighbors, it is more likely to have clusters that look like chains in which pairs of points are close to each other.
➩ Since complete linkage distance finds the farthest neighbors, it is more likely to have clusters that look like blobs (for example circles) in which all points are closest to a center.
➩ The choice usually depends on the application. 

Comparison Example
➩ Compare single and complete linkage clustering on the circles, moons datasets.
➩ Code for clustering: Notebook.

 Number of Clusters
➩ The number of clusters are usually chosen based on application requirements, since there is no optimal number of clusters.
➩ If the number of clusters is not specified, the algorithm can output a clustering tree, called dendrogram.
scipy.cluster.hierarchy.dendrogram: Doc.

📗 Comparison
➩ Since the labeling of clusters are arbitrary, two clusterings with labels that are permutations of each other should be considered as the same clustering.
➩ Rand index is one measure of similarity between clusterings.
sklearn.metrics.rand_score(y1, y2) computes the similarity between clustering y1 and clustering y2, given by two lists of labels: Doc.
➩ To compute the Rand index, loop through all pairs of items and count the number of pairs the clusterings agree on, then divide it by the total number of pairs.

📗 Adjusted Rand Index
➩ Rand index is a similarity score between 0 and 1, where 1 represents perfect match: the clustering labels are permutations of each other.
➩ The meaning of 0 is not clear.
➩ Adjusted Rand index is used so that 0 represents random labeling.
sklearn.metrics.adjusted_rand_score(y1, y2) computes the similarity between clustering y1 and clustering y2, given by two lists of labels: Doc.


 Notes and code adapted from the course taught by Yiyin Shen Link and Tyler Caraza-Harter Link






Last Updated: June 19, 2024 at 11:27 PM