Young Wu's Homepage

Previous: P3, Next: P5

Back to week 9 page: Link

Official Due Date: August 1

# Programming Problem Instruction

📗 Enter your ID here: and click

📗 The same ID should generate the same set of parameters. Your answers are not saved when you close the browser. You could either copy and paste your console output into the text boxes or print your output to text files (.txt) and load them using the button above the text boxes.

📗 Please report any bugs on Piazza.

# Warning: please enter your ID before you start!

📗 (Introduction) In this programming homework, you cluster countries into groups with the COVID-19 dataset. You will estimate parametric models of virus spread and cluster the countries using these parameters as features.

📗 (Part 1) Download the COVID-19 global deaths dataset from Johns Hopkins University: Link. It is okay if you use another similar time-series datasets on COVID-19 deaths that you trust. Alternatively, you could also remove rows corresponding to countries that you believe are reporting false data. Combine the rows (add up the numbers) for the same country.

📗 (Part 1) Estimate a parametric model that describes the trend. There should be at least 3 parameters, but preferably less than 10. For a more accurate model, you should divide the number of deaths by the total population of the country, but this is optional for this homework.

(1) One example is to fit a three-parameter logistic curve: Wikipedia. Note that this is not a logistic regression. You could use gradient descent to minimize the squared error loss, or look into non-linear least squares using Newton's methods if you are interested in general curve fitting techniques.
(2) Another example is to fit a three-parameter normal distribution (on the time-differenced data): Wikipedia. You could use maximum likelihood estimation, or perhaps simpler, look into the method of moments if you learned it from a statistics course.
(3) Another example is to directly estimate the growth rates by counting the number of days to double, quadruple, and octuple: see Professor Zhu's Ten Hundred Plot for a two-parameter version: Link.
(4) Other reasonable or creative models are acceptable as long as it is not something trivial like "use the last three numbers in the row" or "count the number of days cases increase more than 10, 100, 1000" or something similar to the TA's solution.

📗 (Part 2) Use hierarchical clustering with single and complete linkages to cluster the countries into \(k\) = clusters based on their parameter values. Use Euclidean distance.

📗 (Part 2) Use k-means clustering to cluster the countries into the same number of clusters based on their parameter values. Compute final the cluster centers and the total distortion. Use Euclidean distance.

# Submission

# Question 1 [1 points]

📗 (original) Enter the cumulative time series for the US and Canada (remember to add up the numbers from each state or province) (two lines, each line containing integers, comma separated).

# Question 2 [2 points]

📗 (difference) Enter the differenced time series for the US and Canada (compute the difference between consecutive numbers in the previous question, here, this time series represents the number of additional deaths each day) (two lines, each line containing integers, one less than the number of integers per line in the previous question, comma separated).

# Question 3 [5 points]

📗 Briefly explain the method you use to obtain the parameters. (Auto-grade will assign 5/5 for anything you enter, but I will go through them manually after the final exam to check if it is something trivial.)

# Question 4 [5 points]

📗 (parameters) Input the parameter estimates as a matrix, one row for each country (more than 100 less than 200 lines, each line contains at least 3 numbers, comma separated, rounded to 2 decimal places). Call the number of row \(n\) and the number of columns \(m\) for later questions. Please do not include an index column or a column with country names etc.

# Question 5 [10 points]

📗 (hacs) Input the clusters from single linkage hierarchical clustering (one line containing \(n\) integers from \(0\) to the \(k-1\), comma separated).

# Question 6 [10 points]

📗 (hacc) Input the clusters from complete linkage hierarchical clustering (one line containing \(n\) integers from \(0\) to the \(k-1\), comma separated).

# Question 7 [10 points]

📗 (kmeans) Input the clusters from k means clustering (one line containing \(n\) integers from \(0\) to the \(k-1\), comma separated).

# Question 8 [5 points]

📗 (centers) Enter the cluster centers from k means clustering (\(k\) lines, each line containing \(m\) numbers, comma separated, rounded to 4 decimal places).

# Question 9 [5 points]

📗 Enter the total distortion (use sum of squared distances) of the clustering from the previous two questions.

# Question 10 [1 points]

📗 Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the question that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.

📗 Answer: .

# Grade

* * * * *

* * * * *

📗 Warning: grading may take around 10 to 20 seconds. Please be patient and do not click "Grade" multiple times.

📗 Please copy and paste the text between the *s (not including the *s) and submit it on Canvas, P4.

📗 Please submit your code and outputs on Canvas, P4S.

📗 You could also save your output as a single text file using the button and submit this to P4S (with your code).

📗 Warning: the load button does not function properly for all questions, please recheck everything after you load. You could load your answers using the button from the text field:

📗 Saving and loading may take around 5 to 10 seconds. Please be patient and do not click the buttons multiple times.

# Hints and Solutions (frequently updated)

📗 Tie breaking rules likely won't matter, but auto-grader does prefer using the same rules described in M8.

📗 Question 8 is graded based on Question 7, so make sure you recompute the centers based on the clusters in Question 7.

📗 Question 9 depends on Questions 7 and 8, please make sure that all points are reassigned according to the cluster centers in Questions 8.

📗 Questions 5, 6, 7 are graded based on adjusted rand index, as a result, tie breaking rules and permutation of indices should not matter. Details about rand index: Wikipedia.

📗 A sample solution in Java and Python is posted below.

Important notes:
(1) The complete linkage and Euclidean distance versions are not included.
(2) The total distortion should be computed by sum of squared Euclidean distances to the centers.
(3) You are allowed to copy and use parts of the TA's solution without attribution. You are allowed to use code from other people and from the Internet, but you must state in the comments clearly where they come from!

Python code by Hugh Liu: Link
Java code by Ainur Ainabekova: Link.

Last Updated: July 14, 2024 at 9:37 PM