📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit the "Enter" key) 1,2,3,4,5,6,7,8,9,10,11p4
📗 You can also load from your saved file and click .
📗 If the questions are not generated correctly, try refresh the page using the button at the top left corner.
📗 The official deadline is July 24, late submissions within two weeks will be accepted without penalty, but please submit a regrade request form: Link.
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could either copy and paste or load your program outputs into the text boxes for individual questions or print all your outputs to a single text file and load it using the button at the bottom of the page.
📗 Please do not refresh the page: your answers will not be saved.
📗 You should implement the algorithms using the mathematical formulas from the slides. You can use packages and libraries to preprocess and read the data and format the outputs. It is not recommended that you use machine learning packages or libraries, but you will not lose points for doing so.
📗 (Introduction) In this programming homework, you cluster US states into groups with the COVID-19 dataset. You will estimate parametric models of virus spread and cluster the states using these parameters as features.
📗 (Part 1) Download the US COVID-19 deaths dataset from Johns Hopkins University: Link, the dataset "time_series_covid19_deaths_US.csv". It is okay if you use another similar time-series datasets on COVID-19 deaths that you trust. Extract the data for the 50 states and remove other rows, for example, ones for the cruise ships.
📗 (Part 1) Estimate a parametric model that describes the trend. There should be at least 5 parameters, you can add more of your own, but preferably less than 10. For a more accurate model, you should divide the number of deaths by the total population of the state, you can use the "Population" column in the Johns Hopkins dataset, or use another dataset, for example, Wikipedia. Also remember to rescale the parameters so that they have the same range \(\left[0, 1\right]\). The list of parameters you should include (\(x\) is the time series data with length \(T\) and \(\Delta x\) is the first difference):
(1) Mean of the time-differenced data: \(\hat{\mu} = \dfrac{1}{T} \displaystyle\sum_{t=1^\top} \Delta x_{t}\).
(2) Standard deviation of the time-differenced data: \(\hat{\sigma} = \sqrt{\dfrac{1}{T} \displaystyle\sum_{t=1^\top} \left(\Delta x_{t} - \hat{\mu}\right)^{2}}\).
(3) Median of the time-differenced data: \(\hat{m} = \Delta x_{\left(\left[\dfrac{T}{2}\right]\right)}\), where \(\left[f\right]\) is the integer part of \(f\) and \(x_{\left(i\right)}\) is the \(i\)th item in the sorted list of \(x\).
(4) Linear trend coefficient of the data: \(\hat{\beta} = \dfrac{\displaystyle\sum_{t=1^\top} \left(x_{t} - \hat{\mu}\right) \left(t - \dfrac{T + 1}{2}\right)}{\displaystyle\sum_{t=1^\top} \left(t - \dfrac{T + 1}{2}\right)^{2}}\).
(5) Auto-correlation of the data: \(\hat{\rho} = \dfrac{\displaystyle\sum_{t=2^\top} \left(x_{t} - \hat{\mu}\right)\left(x_{t-1} - \hat{\mu}\right)}{\displaystyle\sum_{t=1^\top} \left(x_{t} - \hat{\mu}\right)^{2}}\).
You can include other variables such as the growth rates, say by counting the number of days to double, quadruple, and octuple, etc: see Professor Zhu's Ten Hundred Plot for a two-parameter version: Link. You can also use estimation of trends using curve-fitting algorithms, for example, linear regression coefficients for squared time, cubed time, etc.
📗 (Part 2) Use hierarchical clustering with single and complete linkages to cluster the states into \(k\) = clusters based on their parameter values. Use Euclidean distance. The clustering is graded by checking the similarity between your clustering and mine using the measure adjusted Rand index, for details see: Wikipedia.
📗 (Part 2) Use k-means clustering to cluster the states into the same number of clusters based on their parameter values. Compute final the cluster centers and the total distortion. Use Euclidean distance. The clustering is graded by checking whether the k-means algorithm converged, so it is okay if you stop at a local minimum.
📗 [1 points] Enter the cumulative time series for the Wisconsin and another state of your choice (two lines, each line containing integers, comma separated). Do not divide the numbers by the population or otherwise normalize the data.
Hint
You should add up the rows for the same state.
📗 [2 points] Enter the differenced time series for Wisconsin and the other state you chose (compute the difference between consecutive numbers in the previous question, here, this time series represents the number of additional deaths each day) (two lines, each line containing integers, one less than the number of integers per line in the previous question, comma separated).
Hint
If one line of the answer to Question 1 is \(x_{1}, x_{2}, ..., x_{T}\), then the differenced time series would be \(x_{2} - x_{1}, x_{3} - x_{2}, ..., x_{T} - x_{T-1}\).
📗 [5 points] Input the parameter estimates as a matrix, one row for each state (50 lines, each line contains at least 5 numbers, comma separated, rounded to 4 to 8 decimal places). Call the number of rows \(n\) and the number of columns \(m\) for later questions. Make sure you rescale the parameters so that they have the same range \(\left[0, 1\right]\) (or you can use something else if most of the numbers in the same column is 0 or 1) and please do not include an index column or a column with state names etc.
Hint
📗 Compute \(\hat{\mu}, \hat{\sigma}, \hat{m}, \hat{\beta}, \hat{\rho}\) and your other parameters (a total of \(m\) parameters) for each line. For \(n\) ~ 50 states, you should have \(n\) lines and \(m\) numbers each.
📗 Let \(p_{i j}\) be the number in row \(i\) column \(j\), then the rescaled entry in row \(i\) column \(j\) would be \(x_{i j} = \dfrac{p_{i j} - \displaystyle\min_{i \in \left\{1, 2, ..., n\right\}} p_{i j}}{\displaystyle\max_{i \in \left\{1, 2, ..., n\right\}} p_{i j} - \displaystyle\min_{i \in \left\{1, 2, ..., n\right\}} p_{i j}}\). You can rescale your parameters non-linearly too.
📗 [10 points] Input the clusters from single linkage hierarchical clustering (one line containing \(n\) integers from \(0\) to the \(k-1\), comma separated). For example, if your clusters are {1, 3, 5} in cluster 1 and {2, 4} in cluster 2, you should enter "0, 1, 0, 1, 0" or "1, 0, 1, 0, 1".
Hint
📗 Start with \(n\) clusters.
📗 Compute the distances between each pair of clusters, \(C_{i}\), \(C_{j}\) by,
📗 [10 points] Input the clusters from complete linkage hierarchical clustering (one line containing \(n\) integers from \(0\) to the \(k-1\), comma separated). For example, if your clusters are {1, 3, 5} in cluster 1 and {2, 4} in cluster 2, you should enter "0, 1, 0, 1, 0" or "1, 0, 1, 0, 1".
Hint
📗 Start with \(n\) clusters.
📗 Compute the distances between each pair of clusters, \(C_{i}\), \(C_{j}\) by,
📗 [10 points] Input the clusters from k means clustering (one line containing \(n\) integers from \(0\) to the \(k-1\), comma separated). For example, if your clusters are {1, 3, 5} in cluster 1 and {2, 4} in cluster 2, you should enter "0, 1, 0, 1, 0" or "1, 0, 1, 0, 1".
Hint
📗 Start with k random points in the dataset as the cluster centers \(c_{1}, c_{2}, ..., c_{k}\).
📗 Compute the clusters for each point \(x_{i}\) by finding \(k\) such that \(c_{k}\) is the closest (among the \(k\) centers) to \(x_{i}\).
📗 Recompute the cluster centers,
\(c_{k} = \dfrac{1}{\left| C_{k} \right|} \displaystyle\sum_{x_{i} \in C_{k}} x_{i}\), \(C_{k}\) is the set of points in cluster \(k\).
📗 [5 points] Enter the cluster centers from k means clustering (\(k\) lines, each line containing \(m\) numbers, comma separated, rounded to 4 decimal places).
📗 [1 points] Please confirm that you are going to submit the code on Canvas under Assignment P4, and make sure you give attribution for all blocks of code you did not write yourself (see bottom of the page for details and examples).
📗 [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the question that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.
📗 Please do not modify the content in the above text field: use the "Grade" button to update.
📗 Warning: grading may take around 10 to 20 seconds. Please be patient and do not click "Grade" multiple times.
📗 You could submit multiple times (but please do not submit too often): only the latest submission will be counted.
📗 Please also save the text in the above text box to a file using the button or copy and paste it into a file yourself . You can also include the resulting file with your code on Canvas Assignment P4.
📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##p: 4" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.
📗 Saving and loading may take around 10 to 20 seconds. Please be patient and do not click "Load" multiple times.
📗 The sample solution in Java and Python will be posted around the deadline. You are allowed to copy and use parts of the solution with attribution. You are allowed to use code from other people (with their permission) and from the Internet, but you must and give attribution at the beginning of the your code. You are allowed to use large language models such as GPT4 to write parts of the code for you, but you have to include the prompts you used in the code submission. For example, you can put the following comments at the beginning of your code:
% Code attribution: (TA's name)'s P4 example solution.
% Code attribution: (student name)'s P4 solution.
% Code attribution: (student name)'s answer on Piazza: (link to Piazza post).
% Code attribution: (person or account name)'s answer on Stack Overflow: (link to page).
% Code attribution: (large language model name e.g. GPT4): (include the prompts you used).
📗 Sample solution from last year: 2022 P4. The homework is slightly different, please use with caution.
📗 You can get help on understanding the algorithm from any of the office hours; to get help with debugging, please go to the TA's office hours. For times and locations see the Home page. You are encouraged to work with other students, but if you use their code, you must give attribution at the beginning of your code.