Prev: P3 Next: P5
Back to week 5 page: Link



# P4 Programming Problem Instruction

📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit enter key)
📗 The official deadline is August 1, but you can submit or resubmit without penalty until August 15.
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could either copy and paste or load your program outputs into the text boxes for individual questions or print all your outputs to a single text file and load it using the button at the bottom of the page.
📗 Please do not refresh the page: your answers will not be saved.
📗 You should implement the algorithms using the mathematical formulas from the slides. You can use packages and libraries to preprocess and read the data and format the outputs. It is not recommended that you use machine learning packages or libraries, but you will not lose points for doing so.
📗 Please report any bugs on Piazza.

# Warning: please enter your ID before you start!



📗 (Introduction) In this programming homework, you cluster US states into groups with the COVID-19 dataset. You will estimate parametric models of virus spread and cluster the states using these parameters as features.

📗 (Part 1) Download the US COVID-19 deaths dataset from Johns Hopkins University: Link, the dataset "time_series_covid19_deaths_US.csv". It is okay if you use another similar time-series datasets on COVID-19 deaths that you trust.

📗 (Part 1) Estimate a parametric model that describes the trend. There should be at least 5 parameters, but preferably less than 10. For a more accurate model, you should divide the number of deaths by the total population of the state, you can use the "Population" column in the Johns Hopkins dataset, or use another dataset, for example, Wikipedia. Also remember to rescale the parameters so that they have the same range. The list of parameters you could estimate:
(1) Descriptive statistics for the time-differenced data: for example, mean, variance, standard deviation, median, selected percentiles, max, etc. Some statistics such as mode and min might produce the same value for all states, please do not use those parameters.
(2) Maximum likelihood estimates of parametric models, for example, \(\hat{\mu}\) and \(\hat{\sigma}\) for a normal distribution, or \(\hat{\lambda}\) for a Poisson distribution. You can use a combination of the parameters from multiple models.
(3) Direct estimate the growth rates by counting the number of days to double, quadruple, and octuple, etc: see Professor Zhu's Ten Hundred Plot for a two-parameter version: Link.
(4) Estimation of trends using curve-fitting algorithms, for example, linear regression coefficients for time, squared time, cubed time, etc.
(5) Other reasonable or creative parameters are acceptable as long as it is NOT something trivial like "use the last three numbers in the row" or "count the number of days cases increase more than 10, 100, 1000" or something similar to the TA's solution.

📗 (Part 2) Use hierarchical clustering with single and complete linkages to cluster the states into \(k\) = clusters based on their parameter values. Use Euclidean distance.

📗 (Part 2) Use k-means clustering to cluster the states into the same number of clusters based on their parameter values. Compute final the cluster centers and the total distortion. Use Euclidean distance.

# Question 1

📗 [1 points] Enter the cumulative time series for the Wisconsin and another state of your choice (two lines, each line containing integers, comma separated).
Hint You should add up the rows for the same state.




# Question 2

📗 [2 points] Enter the differenced time series for Wisconsin and the other state you chose (compute the difference between consecutive numbers in the previous question, here, this time series represents the number of additional deaths each day) (two lines, each line containing integers, one less than the number of integers per line in the previous question, comma separated).




# Question 3

📗 [5 points] Enter the list of parameters you estimated. Include a brief description if the parameter is not from items (1) to (4). This question may be regraded manually.


# Question 4

📗 [5 points] Input the parameter estimates as a matrix, one row for each state (~50 lines, each line contains at least 5 numbers, comma separated, rounded to 2 decimal places). Call the number of row \(n\) and the number of columns \(m\) for later questions. Make sure you rescale the parameters so that they have the same range and please do not include an index column or a column with state names etc.




# Question 5

📗 [10 points] Input the clusters from single linkage hierarchical clustering (one line containing \(n\) integers from \(0\) to the \(k-1\), comma separated). For example, if your clusters are {1, 3, 5} in cluster 1 and {2, 4} in cluster 2, you should enter "0 1 0 1 0" or "1 0 1 0 1".
Hint
📗 Start with \(n\) clusters.
📗 Compute the distances between each pair of clusters, \(C_{i}\), \(C_{j}\) by,
\(d\left(C_{i}, C_{j}\right) = \displaystyle\min\left\{ \left\|x_{i}, x_{j}\right\| : x_{i} \in C_{i}, x_{j} \in C_{j}\right\}\).
📗 Merge the pair with the smallest distance between them.
📗 Continue until there are only \(k\) clusters.




# Question 6

📗 [10 points] Input the clusters from complete linkage hierarchical clustering (one line containing \(n\) integers from \(0\) to the \(k-1\), comma separated). For example, if your clusters are {1, 3, 5} in cluster 1 and {2, 4} in cluster 2, you should enter "0 1 0 1 0" or "1 0 1 0 1".
Hint
📗 Start with \(n\) clusters.
📗 Compute the distances between each pair of clusters, \(C_{i}\), \(C_{j}\) by,
\(d\left(C_{i}, C_{j}\right) = \displaystyle\max\left\{ \left\|x_{i}, x_{j}\right\| : x_{i} \in C_{i}, x_{j} \in C_{j}\right\}\).
📗 Merge the pair with the smallest distance between them.
📗 Continue until there are only \(k\) clusters.




# Question 7

📗 [10 points] Input the clusters from k means clustering (one line containing \(n\) integers from \(0\) to the \(k-1\), comma separated). For example, if your clusters are {1, 3, 5} in cluster 1 and {2, 4} in cluster 2, you should enter "0 1 0 1 0" or "1 0 1 0 1".
Hint
📗 Start with k random points in the dataset as the cluster centers \(c_{1}, c_{2}, ..., c_{k}\).
📗 Compute the clusters for each point \(x_{i}\) by finding \(k\) such that \(c_{k}\) is the closest (among the \(k\) centers) to \(x_{i}\).
📗 Recompute the cluster centers,
\(c_{k} = \dfrac{1}{\left| C_{k} \right|} \displaystyle\sum_{x_{i} \in C_{k}} x_{i}\), \(C_{k}\) is the set of points in cluster \(k\).
📗 Repeat until the cluster centers do not change.




# Question 8

📗 [5 points] Enter the cluster centers from k means clustering (\(k\) lines, each line containing \(m\) numbers, comma separated, rounded to 4 decimal places).
Hint See the hints for the previous question.




# Question 9

📗 [5 points] Enter the total distortion (use sum of squared distances) of the clustering from the previous two questions.
Hint
📗 Total distortion is the sum of the squares of distances from the points to the cluster center,
\(D = \displaystyle\sum_{i=1}^{n} \left\|x_{i}, c_{k_{i}}\left(x_{i}\right)\right\|\), \(x_{i}\) belongs to the cluster \(k_{i}\).


# Question 10

📗 [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the question that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.
📗 Answer: .

# Grade


 * * * *

 * * * * *

# Submission


📗 Please do not modify the content in the above text field: use the "Grade" button to update.
📗 Warning: grading may take around 10 to 20 seconds. Please be patient and do not click "Grade" multiple times.


📗 You could submit multiple times (but please do not submit too often): only the latest submission will be counted. 
📗 Please also save the text in the above text box to a file using the button or copy and paste it into a file yourself . Please submit the resulting file with your code on Canvas Assignment P4.
📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##p: 4" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.


📗 Saving and loading may take around 10 to 20 seconds. Please be patient and do not click "Load" multiple times.

# Solutions

📗 The sample solution in Java and Python will be posted around the deadline. You are allowed to copy and use parts of the solution without attribution. You are allowed to use code from other people and from the Internet, but you must give proper attribution at the beginning of the your code. MOSS will be used for code plagiarism check: blocks of copied code without attribution will result in a zero for the whole assignment.
📗 Sample solution from last year: 2020 P4. The homework is slightly different, please use with caution.
📗 Sample solution:
Java: File
Python: File
You have to write the code for part 1 yourself.
For part 2, the solution is for single linkage clustering. The way the clusters are printed is also incorrect: for example, if the cluster are {1, 2}, {3, 4, 5}, {6, 7}, your output should be [0, 0, 1, 1, 1, 2, 2] or [1, 1, 0, 0, 0, 2, 2] etc.
📗 You can get help on understanding the algorithm from any of the office hours. To get help with debugging code in Java, please come during the Monday to Friday 2:00 to 3:00 Zoom office hours or Saturday to Sunday 2:00 to 3:00 (I can stay for a few hours after 3:00 by appointment) in-person office hours. To get help with debugging code in Python, please come during the Tuesday 3:00 to 5:00 in-person office hours or the Thursday 3:00 to 5:00 Zoom office hours. You are encouraged to work with other students, but if you use their code, you must give attribution at the beginning of your code.





Last Updated: November 18, 2024 at 11:43 PM