Prev: L39, Next: L41

# Lecture

📗 The lecture is in person, but you can join Zoom: 8:50-9:40 or 11:00-11:50. Zoom recordings can be viewed on Canvas -> Zoom -> Cloud Recordings. They will be moved to Kaltura over the weekends.
📗 The in-class (participation) quizzes should be submitted on TopHat (Code:741565), but you can submit your answers through Form at the end of the lectures too.
📗 The Python notebooks used during the lectures can also be found on: GitHub. They will be updated weekly.


# Lecture Notes

TopHat Game ➭ There will be 20 questions on the exam, 10 of them from past exams and quizzes, and 10 of them new questions (see Link for details). I will post \(n\) more questions next Monday that are identical or similar to \(n\) of the new questions on exam.
➭ A: \(n = 0\)
➭ B: \(n = 1\) if more than 50 percent of you choose B.
➭ C: \(n = 2\) if more than 75 percent of you choose C.
➭ D: \(n = 3\) if more than 95 percent of you choose D.
➭ E: \(n = 0\)

📗 K Means Clustering
➭ Another clustering method is K means cluster: Link.
(0) Start with \(K\) random centers (also called centroids) \(\mu_{1}, \mu_{2}, ..., \mu_{K}\).
(1) Assign step: find points (items) that are the closest to each center \(k\), label these points as \(k\).
(2) Center step: update center \(\mu_{k}\) to be the center of the points labeled \(k\).
(3) Repeat until cluster centers do not change.



📗 Total Distortion
➭ The objective of K means clustering is minimizing the total distortion, also called inertia, the sum of distances (usually squared Euclidean distances) from the points to their centers, or \(\displaystyle\sum_{i=1}^{n} \left\|x_{i} - \mu_{k\left(x_{i}\right)}\right\|^{2}\) = \(\displaystyle\sum_{i=1}^{n} \displaystyle\sum_{j=1}^{m} \left(x_{ij} - \mu_{k\left(x_{i}\right)j}^{2}\right)\), where \(k\left(x_{i}\right)\) is the cluster index of the cluster closest to \(x_{i}\), or \(k\left(x_{i}\right) = \mathop{\mathrm{argmin}}_{k} \left\|x_{i} - \mu_{k}\right\|\).
➭ K means initialized at a random clustering and each assign-center step is a gradient descent step for minimizing total distortion by choosing the cluster centers.

📗 Number of Clusters
➭ The number of clusters are usually chosen based on application requirements, since there is no optimal number of clusters.
➭ If the number of cluster is \(n\) (each point is in a different cluster), then the total distortion is 0. This means minimizing the total distortion is not a good way to select the number of clusters.
➭ Elbow method is sometimes use to determine the number of clusters based on the total distortion, but it is a not a clearly defined algorithm: Link.

Economic Data Example Again ➭ Apply 5-means clustering on the economic data for the US states.
➭ Code for clustering: Notebook.
➭ Compare K-means with different values of K: the "elbow method" seems to suggest around 4 to 6 clusters.




📗 Notes and code adapted from the course taught by Yiyin Shen Link and Tyler Caraza-Harter Link






Last Updated: April 29, 2024 at 1:10 AM