# Lecture Notes
📗 Unsupervised Learning
➩ If the groups are discrete: clustering
➩ If the groups are continuous (lower dimensional representation): dimensionality reduction
➩ The output of unsupervised learning can be used as input for supervised learning too (discrete groups as categorical features and continuous groups as continuous features).
Item |
Input (Features) |
- |
1 |
\(\left(x_{11}, x_{12}, ..., x_{1m}\right)\) |
no label |
2 |
\(\left(x_{21}, x_{22}, ..., x_{2m}\right)\) |
- |
3 |
\(\left(x_{31}, x_{32}, ..., x_{3m}\right)\) |
- |
... |
... |
... |
... |
n |
\(\left(x_{n1}, x_{n2}, ..., x_{nm}\right)\) |
similar \(x\) in the same or similar groups |
US States Economic Data Example
➩ US economics data can be found on
Link.
➩ Map data can be found on
Link.
➩ Use the features "real per capita personal income", "real per capita personal consumption expenditures", and "regional price parities".
➩ Note: see
pivot
for the correct way of working with panel data:
Doc.
Hierarchical Clustering
➩ Hierarchical clustering starts with \(n\) clusters and iteratively merge the closest clusters:
Link.
➩ It is also called agglomerative clustering, and can be performed using
sklearn.cluster.AgglomerativeClustering
:
Doc.
➩ Different ways of defining the distance between two clusters are called different linkages:
scipy.cluster.hierarchy.linkage
:
Doc.
📗 Distance Measure
➩ The distance between points can be measured by norms, the distance between items \(x_{1} = \left(x_{11}, x_{12}, ..., x_{1m}\right)\) and \(x_{2} = \left(x_{21}, x_{22}, ..., x_{2m}\right)\) can be:
(1) Manhattan distance (
metric = "manhattan"
): \(\left| x_{11} - x_{21} \right| + \left| x_{12} - x_{22} \right| + ... + \left| x_{1m} - x_{2m} \right|\),
Link,
(2) Euclidean distance (
metric = "euclidean"
): \(\sqrt{\left(x_{11} - x_{21}\right)^{2} + \left(x_{12} - x_{22}\right)^{2} + ... + \left(x_{1m} - x_{2m}\right)^{2}}\),
(3) Cosine similarity distance (
metric = "cosine"
): \(1 - \dfrac{x^\top_{1} x_{2}}{\sqrt{x^\top_{1} x_{1}} \sqrt{x^\top_{2} x_{2}}}\).
...
Average Linkage Distance
➩ If average linkage distance (linkage = "average"
) is used, then the distance between two clusters is defined as the distance between two center points of the clusters.
➩ This requires recomputing the centers and their pairwise distances in every iteration and can be very slow.
📗 Single and Complete Linkage Distance
➩ If single linkage distance (linkage = "single"
) is used, then the distance between two clusters is defined as the smallest distance between any pairs of points one from each cluster.
➩ If complete linkage distance (linkage = "complete"
) is used, then the distance between two clusters is defined as the largest distance between any pairs of points one from each cluster.
➩ With single or complete linkage distances, pairwise distances between points only have to be computed once at the beginning, so clustering is typically faster.
📗 Single vs Complete Linkage
➩ Since single linkage distance finds the nearest neighbors, it is more likely to have clusters that look like chains in which pairs of points are close to each other.
➩ Since complete linkage distance finds the farthest neighbors, it is more likely to have clusters that look like blobs (for example circles) in which all points are closest to a center.
➩ The choice usually depends on the application.
Comparison Example
➩ Compare single and complete linkage clustering on the circles, moons datasets.
Number of Clusters
➩ The number of clusters are usually chosen based on application requirements, since there is no optimal number of clusters.
➩ If the number of clusters is not specified, the algorithm can output a clustering tree, called dendrogram.
➩
scipy.cluster.hierarchy.dendrogram
:
Doc.
📗 Comparison
➩ Since the labeling of clusters are arbitrary, two clusterings with labels that are permutations of each other should be considered as the same clustering.
➩ Rand index is one measure of similarity between clusterings.
➩
sklearn.metrics.rand_score(y1, y2)
computes the similarity between clustering
y1
and clustering
y2
, given by two lists of labels:
Doc.
➩ To compute the Rand index, loop through all pairs of items and count the number of pairs the clusterings agree on, then divide it by the total number of pairs.
📗 Adjusted Rand Index
➩ Rand index is a similarity score between 0 and 1, where 1 represents perfect match: the clustering labels are permutations of each other.
➩ The meaning of 0 is not clear.
➩ Adjusted Rand index is used so that 0 represents random labeling.
➩
sklearn.metrics.adjusted_rand_score(y1, y2)
computes the similarity between clustering
y1
and clustering
y2
, given by two lists of labels:
Doc.
Notes and code adapted from the course taught by Yiyin Shen
Link and Tyler Caraza-Harter
Link