Prev: L39, Next: L41

# Lecture

📗 The lecture is in person, but you can join Zoom: 8:50-9:40 or 11:00-11:50. Zoom recordings can be viewed on Canvas -> Zoom -> Cloud Recordings. They will be moved to Kaltura over the weekends.
📗 The in-class (participation) quizzes should be submitted on TopHat (Code:741565), but you can submit your answers through Form at the end of the lectures too.
📗 The Python notebooks used during the lectures can also be found on: GitHub. They will be updated weekly.


# Lecture Notes

TopHat Game
➩ There will be 20 questions on the exam, 10 of them from past exams and quizzes, and 10 of them new questions (see Link for details). I will post \(n\) more questions next Monday that are identical or similar to \(n\) of the new questions on exam.
➩ A: \(n = 0\)
➩ B: \(n = 1\) if more than 50 percent of you choose B.
➩ C: \(n = 2\) if more than 75 percent of you choose C.
➩ D: \(n = 3\) if more than 95 percent of you choose D.
➩ E: \(n = 0\)

📗 K Means Clustering
➩ Another clustering method is K means cluster: Link.
(0) Start with \(K\) random centers (also called centroids) \(\mu_{1}, \mu_{2}, ..., \mu_{K}\).
(1) Assign step: find points (items) that are the closest to each center \(k\), label these points as \(k\).
(2) Center step: update center \(\mu_{k}\) to be the center of the points labeled \(k\).
(3) Repeat until cluster centers do not change.

 Total Distortion
➩ The objective of K means clustering is minimizing the total distortion, also called inertia, the sum of distances (usually squared Euclidean distances) from the points to their centers, or \(\displaystyle\sum_{i=1}^{n} \left\|x_{i} - \mu_{k\left(x_{i}\right)}\right\|^{2}\) = \(\displaystyle\sum_{i=1}^{n} \displaystyle\sum_{j=1}^{m} \left(x_{ij} - \mu_{k\left(x_{i}\right)j}^{2}\right)\), where \(k\left(x_{i}\right)\) is the cluster index of the cluster closest to \(x_{i}\), or \(k\left(x_{i}\right) = \mathop{\mathrm{argmin}}_{k} \left\|x_{i} - \mu_{k}\right\|\).
➩ K means initialized at a random clustering and each assign-center step is a gradient descent step for minimizing total distortion by choosing the cluster centers.

📗 Number of Clusters
➩ The number of clusters are usually chosen based on application requirements, since there is no optimal number of clusters.
➩ If the number of cluster is \(n\) (each point is in a different cluster), then the total distortion is 0. This means minimizing the total distortion is not a good way to select the number of clusters.
➩ Elbow method is sometimes use to determine the number of clusters based on the total distortion, but it is a not a clearly defined algorithm: Link.

Economic Data Example Again
➩ Apply 5-means clustering on the economic data for the US states.
➩ Code for clustering: Notebook.
➩ Compare K-means with different values of K: the "elbow method" seems to suggest around 4 to 6 clusters.


 Notes and code adapted from the course taught by Yiyin Shen Link and Tyler Caraza-Harter Link






Last Updated: June 19, 2024 at 11:27 PM